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Abstract 

In this paper, I construct a new test of conditional moment inequalities, 
which is based on studentized kernel estimates of moment functions with many 
different values of the bandwidth parameter. The test automatically adapts 
to the unknown smoothness of moment functions and has uniformly correct 
asymptotic size. The test has high power in a large class of models with condi- 
tional moment inequalities. Some existing tests have nontrivial power against 
n _1 / 2 -local alternatives in a certain class of these models whereas my method 
only allows for nontrivial testing against (n/ log n) _1//2 -local alternatives in this 
class. There exist, however, other classes of models with conditional moment 
inequalities where the mentioned tests have much lower power in comparison 
with the test developed in this paper. 

Keywords: Conditional Moment Inequalities, Minimax Rate Optimality. 



1 Introduction 

Conditional moment inequalities (CMI) are often encountered both in economics and 
econometrics. In ec onomics, they arise naturally in many models that include be- 



havioral choice, see 



Pakesl (120101 ) for a survey. In these models, an agent chooses 
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the action that maximizes expected utility given her information set. Comparing the 
realized action with any other available action leads to CMI. In econometrics, they 
appear in the estimation problem s with interval data and problems with censoring, 



Manski and Tamer! (120021 ) . In addition, CMI offer a convenien t way to study 



e.g., see 

treatment effects in randomized experiments as described in 
next section, I provide three detailed examples of models with CMI. 



LeeetaL 



fj201lh . In the 



Let m: R d x 
pair of R d and R> 
written as 



R fc x — > MP be a vector-valued known function. Let (X, W) be a 
c -valued random vectors, and 9 6 a parameter. The CMI can be 



E[m(X,W,B)\X) <0a.s. 



1.1) 



where inequalities are understood piecewise. I am interested in testing the null hy- 
pothesis, Hq, that 9 = 6q against the alternative, H a , that 9 ^ 9$ based on iid sample 
(Xi,Wi)f = i from the distribution of (X,W). Note that I also allow for conditional 
moment equalities since they can be written as pairs of the CMI in model (11. ip . 

Using CMI for inference is difficult because often these inequalities do not identify 
the parameter. Let 



9 7 = {9 e : E[m(X,W,B)\X) < a.s.} 



;i-2) 



denote the identified set. The model is said to be identified if and only if 0/ is 
a singleton. Otherwise, CMI do not identify the parameter 9. For example, the 
latter may happen when the CMI arise from a game-theoretic model with multiple 
equilibria. Moreover, the parameter may be weakly identified. My approach leads to 
a test with the correct asymptotic size no matter whether the parameter is identified, 
weakly identified, or not identified. 

Two appro aches to robust CMI te sting have been developed in the literature. 
One approach ([Andrews and Shil (120101 )). is based on converting CMI into an infinite 
number of unconditio nal moment inequalit i es usin g nonnegative weighting functions. 
The other approach ( Chernozhukov et al.l ( 2009 )). is based on es timating moment 
functi ons nonparametrically. My method is inspired by the work of I Andrews and S 
(120101 ). To motivate the test developed in this paper, consider two examples of CMI 
models. These models are highly stylized but convey main ideas. In the first model, 
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m is multiplicatively separable in 9, i.e. m(X, W, 9) = 9m(X, W) for some fh : 
R d xtf-^K and 9 eR with E[m(X, W)\X) > almost surely. In the second model, 
m is additively separable in 9, i.e. m(X, W, 9) = rh(X, W) + 9. The identified sets, 
6/, in these mod els are \9 e R : 9 < 0} a nd {9 eR: 9 < -ess sup x E[m{X, W)\X}} 
correspondingly. lAndrews and Shil ( 120101 ) developed a test that has nontrivial power 
against alternatives of the form 9 = 9 G n = C / yfn for any C > i n the first model, so 
their test has extremely high power in this model. It follows from lArmstrongl (j2011af ) 
that their t est has low power in t he sec ond model, however (e.g., in comparison with 
the test of 



Chernozhukov et al 



In constrast, I construct a test that has 
high power in a large class of CMI models including models like that in the second 
example. At the same time, my test has virtually the same power in models like that 
described in the first example. The main difference between two approaches is that 
my test statistic is b ased on the studentized estimates of moments whereas theirs is 
not. More precisely, lAndrews and Shil (120101 ) also consider studentization but they 
modify the variance term so that asymptotic power properties of their test are similar 
to those of the test with no studenti z ation. 

The test of IChernozhukov et al.l (120091 ) also has high power in a large class of 
CMI models but it requires knowledge of certain smoothness properties of moment 
functions such as order of differentiability whereas the test developed in this paper 
does not. Moreover, my test automatically adapts to these smoothness properties se- 
lecting the most appropriate weighting function. This feature of the test is important 
because smoothness properties of moment functions are rarely known in practice. For 
this reason, I call the test adaptive. 

The test statistic in this paper is based on kernel estimates of moment func- 
tions E[rrij(X, W, 9q)\X] with many bandwidth values using positive kernels^. Here 
rrij(X, W, 9) denotes j-th component of rn(X, W, 9). I assume that the set of band- 
width values expands as the sample size n increases so that the minimal bandwidth 
value converges to zero at an appropriate rate while the maximal one is fixed. Since 
the variance of the kernel estimators varies greatly with the bandwidth value, each 



1 An drews and Shi (|2010| ) developed tests based on both Cramer-von Mises and Kolmogorov- 
Smirnov test statistics. In this paper, I mainly refer to their test with Kolmogorov-Smirnov test 
statistic. Most statements are also applicable for Cramer-von Mises test statistic as well, however. 

2 A kernel is said to be positive if the kernel function is positive on its support. 
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estimator is studentized, i.e. it is divided by its estimated standard deviation. The 
test statistic, T, is formed as the maximum of these studentized estimates, and large 
values of T suggest that the null hypothesis is violated. 

I develop a bootstrap method to simulate the critical value for the test. The 
method is based on the observation that the distribution of the test statistic, condi- 
tionally on the values {Xi}™ =1 , is asymptotically independent of the distribution of the 
noise {rnpQ, Wi,9o) —E[m(Xi, Wj, 9n) \Xj]}?^ apart f r om its seco nd moment. Forrea 
sons similar to those discussed in 



Chernozhukov et al. 



( 120071 ) and 



Andrews and Spares 



(120101 ). the distribution of the test statistic in large samples depends heavily on the 
extent to which CMI are binding. Moreover, the parameters that measure to what 
extent CMI are binding can not be estimated consistently. I develop a new approach 
to deal with this problem, which I refer to as the refined moment selection (RMS) 
procedure. The approach is based on the pretest that is used to decide what counter- 
parts of the test sta tistic should be used in simulating the critical value for the test. 



In comparison with lAndrews and Shil (120101 ). I use a model-specific critical value for 
the pretest, which is simulated as a high quantile of the appropriate distribution, 
whereas they use a deterministic threshold with no reference to the model. For com- 
parison reasons, I also provide a plug- in critical value for the test. My proof of the 
bootstrap validity is interesting on its own right because it is not known whether the 
test statistic converges in distribution somewhere or not. 

None of the tests in the literature including mine have power against alternatives 
in the set 0/. Therefore, I consider the alternatives of the form 

P{E[mj{X,W,9o)\X] > 0} > Ofor somej = l,...,p (1.3) 

To show that my test has good power properties in a large class of CMI models, I de- 
rive its power against alternatives of the form (11.3[) assuming that E[m(X, W, 9q)\X] is 
some vector of unrestricted nonparametric functions. In other words, I consider non- 
parametric classes of alternatives. Once m(X, W, 9) is specified, it is straightforward 
to translate my results into the parametric setting. The test developed in this paper 
is consistent against any fixed alternative outside of the set 0j. I also show that 
my method allows for nontrivial testing against (n/ logn) -1 / 2 -local one-directional 
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alternatives^. Finally, I prove that the test is minimax rate optimal against certain 
classes of smooth alternatives consisting of moment functions E[m(X, W, 9o)\X] that 
are sufficiently flat at the points of maxima. Minimax rate optimality means that the 
test is uniformly consistent against alternatives in the mentioned class whose distance 
from the set of models satisfying (11. ip converges to zero at the fastest possible rate. 
The requirement that functions should be sufficiently flat can not be dropped because 
the test is based on the positive kernels. 

The literature concerned with unconditional and conditional moment inequalities 



ities includes ( 


Jhernozhukov et al. 


(2007 


). 


Romano and Shaikh 


(2008 


). 


Rosen 


(2008). 


Andrews and Gueeenbereei 


(2009 


)• 


Andrews and Han 


2009L 


Andrews and Soares 


(2010 


)• 


Bugni 


(2010 




Canav 


(20K 


3), 


Pakes 


(2010 


), and 


Romano and Shaikh 


(2010). 



I note that there is also a large literature on partial identification which is close related 



to that on momen t inequalities. Methods s p ecific 



were developed in 



Andrews and Shil (120 101 ) 



Khan and Tamer 



Lee et al 



2009 



(2011). 



or co n ditional moment inequalitie s 

(120091). 



Kiml (120 08). 



T he case of C MI that point identify 6 is tr eated in 
of 



Ariiistr ongff 



Chernozhukov et al 



2011a). and 



Armstrong! (1201 lb! ) . 



Kiml ( 120081 ) is closely related to that of Andrews and Shil (120101 ) . 



Khan and Tamerl (120091). T he test 

(2011) 



Lee et al. 



developed a test based on the minimum distance statistic in the one-sided L p -norm 
and kernel estimates of moment functions. The advantage of their approach comes 
from simplicity of their critical value for the test, which is an appropriate quantile of 
the standard Gaussian distrib ution. The i r test i s not adaptive, however, since only 



one bandwidth value is used. 



Armstrong! (1201 la!) developed a new met hod for com- 



puting the critical value for the test statistic of Andrews and Shil (120101 ) which leads 
to a more powerful test than theirs but his method is not robust. In particular, his 
method can not be used in the CMI models like that described in the first example 
above. Armstrong! (l2011bl ) considered the test statistic similar to that used in this 



paper but he focused on estimation rather than inference. 



Finally, an important related paper in the statistical literature is lDumbgen and Spokoiny 



(120011 ). They consider testing qualitative hypotheses in the ideal Gaussian white noise 
model where a researcher observes a stochastic process that can be represented as a 



3 In this paper, by one directional alternatives, I mean alternatives of the form E[m(X, W, 0o)\X) = 
a n f(X) for some sequence of positive numbers {a n }%Li converging to zero where / satisfies (jl.3|) . 
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sum of the mean function and a Brownian motion. In particular, they developed a 
test for the null hypothesis that the mean function is (weakly) negative almost ev- 
erywhere. Even though their test statistic is somewhat related to that used in this 
paper, the technical details of their analysis are quite different. 

The rest of the paper is organized as follows. The next section elaborates on some 
examples of CMI models. Section [3] formally introduces the test. The main results of 
the paper are presented in section HI A Monte Carlo simulation study is described in 
section [6j There I provide an example of an alternative with the well-behaved moment 
function such that the test developed in this paper rejects the null hypothesis with 
probability higher than 80% while the rejection probability of all competing tests 
does not exceed 20%. Brief conclusions are drawn in section [71 Finally, all proofs are 
contained in the Appendix. 



2 Examples 



In this section, I provide three examples where CMI arise naturally in economic and 
econometric models. The first two examples have function-valued parameters. In 
order to fit these examples into my framework, one can consider parametric approxi- 
mations of corresponding functions. 



Incomplete Models of English Auctions. My first example follows 



Haile and Tamer 



( 120031 ) treatment of English a uctions under weak c o nditio ns. The popular model of 
English auctions suggested by iMilgrom and Weberl ( 119821 ) assumes that each bidder 
is holding down the button while the price is going up continuously until she wants 
to drop out. The price at the moment of dropping out is her bid. In this model, it 
is well-known that the dominant strategy is to make a bid equal to her valuation of 
the object. In practice, participants usually call out bids, however. So, the price rises 
in jumps, and the bid may not be equal to person's valuation of the object. In this 
situation, the relation betwe en bids and valuat i ons o f the object depends crucially on 
the modeling assumptions. lHaile and Tamerl ( 120031 ) derived certain bounds on the 
distribution function of valuations based on minimal assumptions of rationality. 

Suppose we have an auction with m bidders whose valuations of the object are 
drawn independently from the distrubution F(-,X) where X denotes observable 
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,-.,b m denote highest bids of each bidder. Let 



characterics of the object. Let b\ 

bi-.m < ••• < b m . m denote the ordered sequence of bid s b-\ , 
bids do not exceed bidders' valuations 
upper bound on F(-,X): 



•> b m . 



Assuming that 

Haile and Tamerl ( 120031 ) derived the following 



E[I{b i:m <v}- <fi (F(v,X))\X] > Oa.s. 



(2.1) 



for al l v € R and i = l , ...,m where </>(•) is a certain (known) function, see equation 
(3) in lHaile and Tamerl (120031 ). Similar lower bound follows from the assumption that 
bidders do not allow oponents to win at a price they would like to beat. Assuming we 
observe an iid sequence of auctions, these CMI can be used for inference on F(v,X). 

Interval Data. In some cases, especially when data concerns personal informa- 
tion like individual income or wealth, one has to deal with interval data. Suppose we 
have a mean regression model 



Y = f(X,V)+e 



(2.2) 



where i?[e|X, V] = a.s. and V is a scalar random variable. Suppose that we observe 
X and Y but we do not observe V. Instead, we observe Vq and V\ called brackets such 
that V G (Vo, Vi) a.s. In empirical analysis, brackets may arise because a respondent 
refuses to provide information on V but provides an interval to which V belongs. 
Following iManski and Tamerl (120021 ) assume that f(X,V) is weakly increasing in V 
and E[Y\X,V} = E[Y\X, V, V , V t ]. Then it is easy to see that 



and 



E[I{V x <v}{Y - f{X,v))\X,Vo,V 1 \ <0 



E[I{V >v}(Y- f(X, v))\X, Vo, Vi] > 



(2.3) 



(2.4) 



for all ueR. If we observe an iid sample from the model, we can use these CMI for 
inference on f(X, V). 

Treatment Effects. Suppose we have a randomized experiment where one group 
of people gets a new treatment while the control group gets a placebo. Let D = 1 if the 
person gets the treatment and otherwise. Let p denote the probability that D — 1. 



7 



Let X denote person's observable characteristics and Y denote a realized outcome. 
Finally, let Y and Y\ denote counterfactual outcomes had the person received a 
placebo or the new medicine respectively. Then Y = DY\ + (1 — D)Y Q . The question 
of interest is whether the new medicine has a positive expected impact uniformly over 
all posible person's charactersics X. In other words, the null hypothesis, H , is that 



Lcc ct al. 



E\Y X - Y Q \X\ > Oa.s. 

Since in randomized experiments D is independent of X, 
that 

E[Yt - Y \X] = E[DY/p - (1 - D)Y/{1 - p)\X] 
Combining (F23]) and ([2J| gives CMI. 



(2.5) 



(I201lh showed 
(2.6) 



3 The Test 

In this section, I present the test statistic and give two bootstrap methods to simulate 
a critical value. Given nonparametric nature of the test, I use the corresponding 
terminology. For fixed O , let Y = m(X,W,9 ), f(X) = E[m(X,W,6 Q )\X], and 
£ = Y — f(X) so that E^X] = a.s. Then under the null hypothesis, 

f(X)<0a.s. (3.1) 

I refer to Y as a response variable, / as a vector-valued regression function, X as a 
design point, and e as a disturbance. Components of / are denoted by /i, f p . 

The analysis in this paper is conducted conditionally on the set of values {X,}" =1 
of the insrumental variable X, so all probabilistic statements in this paper should be 
understood conditionally on {Xj}™ =1 for almost all sequences {X,}" =1 . Lemma H] in 
the Appendix provides certain conditions that insure that assumptions used in this 
paper hold for almost all sequences {Xi}f =1 . 

Section |3~T1 defines the test statistic assuming that E[EisJ] = Ej is known for each 
i = l,...,n. Section [3.21 gives two bootstrap methods to simulate a critical value. 
The first one is based on plug-in asymptotics, and the second one is based on the 
refined moment selection (RMS) procedure. Section 13.21 also provides some intuition 
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of why these procedures lead to the correct asymptotic size of the test. When Ej is 
not known, it should be estimated from the data. Section 13.31 shows how to construct 
an appropriate estimator Ej of Ej. The feasible version of the test will be based on 
substituting Ej for Ej both in the test statistic and in the critical value. 13.41 provides 
some notes on how to choose certain tuning parameters. 



3.1 The Test Statistic 

The test statistic in this paper is based on the kernel estimator of the vector-valued 
regression function /. Let K : W d — > R + be some kernel. For bandwidth value 
h G M+, denote Kh{x) = K(x/h)/h d . For each pair of observations i,j = l,...,n, 
denote the weight function 

MX i ,X j )= Kh ^- Xi) (3.2) 

Then the kernel estimator of f m (Xi) is 

n 

fi,m,h = ^^Wh(Xi, Xj)Yj tm (3.3) 

where Yj m denotes m-th component of response variable Yj. Conditionally on {Xj}" =1 , 
the variance of the kernel estimator fi >ni} h is 

n 

where Ej jmim2 denotes (m 1 ,m 2 ) component of E 3 - = E[ejeJ]. 

Next, consider a finite set of bandwidth values H = {h = h max a k : h > h m i n , k = 

0. 1, 2, ...} for some /i max > /i min and a € (0, 1). For simplicity, I assume that h min = 
^ma X £J fc for some k G N so that h m [ n is included in H. I assume that as the sample 
size n increases, h m i n converges to zero while /i max is fixed. For each bandwidth 
value h G H, choose a subset Ih of observations such that \\X{ — Xj\\ > 2h for all 

1, j G Ih with i j and for each 2 = 1, ...,n, there exist an element G It such 
that \\Xi — Xja\\\ < 2h where || • || denotes the Eucledian norm on M d . I refer to 
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Ih as a set of test points. The choice of Ih may be random, but it is important to 
select Ih independently of response variables {1^}" =1 . So, conditionally on {Xj}™ =1 , I 
assume that Ih is nonstochastic. It will be assumed in the next section that K(x) = 
for any x G M. d such that ||x|| > 1. Thus, random variables {fi, m ,h}iei h are jointly 
independent for any fixed m = 1, ...,p and h G H conditionally on {Xj}™ =1 . This fact 
will play a key role in the derivation of the lower bound on the growth rate of the 
pdf of the test statistic, which is used in the analysis of size properties of the teste. 
Finally, denote S = {(i, m,h) : h G H,i G ift,m = 1, ...,p}. 
Based on this notation, the test statistic is 

fs 

T = max — (3-5) 



es V, 



Let me now explain why the optimal bandwidth value depends on the smoothness 
properties of the components fi, f p of /. Without loss of generality, consider j = 1. 
Suppose that fi(X) is flat. Then fi(X) is positive on the large subset of its domain 
whenever its maximal value is positive. Hence, the maximum of T will correspond to 
a large bandwidth value because the variance of the kernel estimator, which enters 
the denominator of the test statistic, decreases with the bandwidth value. On the 
other hand, if f\ (X) is allowed to have peaks, then there may not exist a large subset 
where it is positive. So, large bandwidth values may not yield large values of T, and 
small bandwidth values should be used in such cases. I circumvent these problems by 
considering the set of bandwidth values jointly, and let the data determine the best 
bandwidth value. In this sense, my test adapts to the smoothness properties of f(X). 
This allows me to construct a test with good uniform power properties over possible 
smoothness of f(X). 

When Sj is not observed, which is usually the case in practice, one can define 
^m,h = T^=iwl{Xi, Xj)t jtmm and use 

fs 

T = max (3.6) 



4 Although my argument in the derivation of the lower bound is based on the fact that {fi,m,h}iei h 
are jointly independent, I believe that the same lower bound can be obtained even for the case 
Ih = {1, n}. If this statement is true, one can use Ih = {l,...,n} in the definition of the test 
statistic. 
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instead of T where Sj is some estimator of Some possible estimators are discussed 
in section 13.31 

3.2 Critical Values 

Suppose we want to construct a test of size a. This subsection explains how to 
simulate a critical value t\- a for the statistic T based on two bootstrap methods. One 
method is based on the plug-in asymptotics, and the other one is based on the refined 
moment selction (RMS) procedure. Both methods have deterministic and randomized 
versions. For the randomized versions, one first determines some small interval, say 
[c, c + (3} with > 0, where the critical value belongs. Then one draws the critical 
value from a certain distribution with the support [c, c + /?]. This randomization 
comes from my proof technique, which is based on the Linderberg method. Under 
somewhat stronger conditions, I also prove the validity of both methods with (3 = 0, 
which corresponds to their deterministic versions. The test will be of the following 
form: reject the null hypothesis if and only if T > t\- a . 

Let (3 be either zero or some small positive number. Let go be a thrice differentiable 
function from R into [0, 1] such that g${x) = 1 for all x < and go(x) = for all 
x > 1. Denote g(x) = go((x — c)/ f3) for some cGl. Since g(x) G [0, 1] for all x e R, 
g(-) gives a randomized test: upon observing the test statistic T = x, one accepts 
the null hypothesis with probability g(x). I will choose c so that, under the null 
hypothesis, E[g(T)} > 1 — a + o(l) as n — > oo, which leads to the correct asymptotic 
size of this randomized test. An equivalent way to describe this test is as follows. Let 
U be a random variable independent of the data with uniform distribution on [0, 1]. 
Define the critical value ti_ a for the test from the equation g{t\^ a ) = U . Since g(x) 
is decreasing in x, this equation has the unique solution so that ti_ a is well-defined. 
Lemma [1] in the Appendix shows that E[g(T)] = P{T < ti_ a }, which means that the 
randomized test is equivalent to the test based on the critical value ti_ a . Note that 
the latter formulation is more convenient for the confidence set construction: one can 
use the same U for all possible values of 8 . For the purposes of presentation, the 
former formulation is suitable, however. I refer to g(-) as a test function. 

Let me now describe two possible bootstrap methods to simulate c. The first 
method is based on plug-in asymptotics. It relies on two observations. First, it is 
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easy to see that, for a fixed distribution of disturbances {£«}™ =1 , the maximum of 1 — a 
quantile of the test statistic T over all possible functions / satisfying / < almost 
surely corresponds to f — P . Second, lemmas 191 and [TT1 in the Appendix show that 
the distribution of the statistic T is asymptotically independent of the distrubution 
of disturbances {Ei : % = 1, n} apart from their second moments {£j : i = 1, n}. 
These observations suggest that one can simulate c by the following procedure: 

1. For each i = 1, ...,n, simulate % ~ N(0 P , Ej) independently across i. 

2. Calculate T PIA = max {i ^ h)&s Y?j=i w h (X h Xj)Y^ m /V^ m>h . 

3. Repeat steps 1 and 2 independently B times for some large B to obtain {Tf IA : 
b=l,...,B}. 

4. Find cfi£ such that £f=i ^((T^ - <$££)/ P)/B = 1 - a. 

Then plug-in test function : K. — > [0, 1] is given by gfj^x) = <7o((zc ~ cfi^)//3) 
for all xei 

The second method is based on the refined moment selection (RMS) procedure. 
It gives a less conservative critical value while maintaining the required size of the 
test. The method is based on the observation that |T| = O p (i/logn) if / = P (see 
lemmas El [91 [TTJin the Appendix) while fi <m ,h/Vi,m,h ~°° with a polynomial rate 
if f m (Xi) < and h — > 0. Such terms will have asymptotically negligible effect on 
the distribution of T, so we can ignore corresponding terms in the simulated statistic. 
Specifically, let 7 < a/2 be some small positive number. First, use the plug-in 
bootstrap to find cf_^. Denote 

s rms = {seS] f s/ y s > _ 2 (cfM + 0)} (37) 

Second, run the following procedure: 

1. For each i — 1, n, simulate % ~ N(0 p , Sj) independently across i. 

2. Calculate T M/5 = max ( ^ m ^ )eS MK £)J =1 WfcpQ, X/jY^/V^h. 

3. Repeat steps 1 and 2 independently 5 times for some large B to obtain {T^ MS : 
6 = 1, ...,5}. 
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JIMS 



4. Find cf_ Q+27 



such that Eti9o((T b RMS 



c 



RMS 
l-a+27 



)//?)/£ = 1 -a + 2 7 . 



RMS 



Then RMS test function g^- a 



is given by ^f_^ 5 (s) = # ((z - cfZ\ 2j )//3 



for all x G R. The additional term 27 can be interpreted as a correction for the 
truncation procedure introduced in S R . 



3.3 Estimating Ej 

Let me now explain how one can esti mate E; . T h e literature on estimating S; i s 
huge. Among other p a pers, it incl udes iRicd (119841) . iMuller and Stadtmullerl ( 119871 ) , 
Hardle and Tsybakovl (119971 ). and iFan and Yaol (119981). For scalar- valued r e spons e 
variables, a variaty of such estimators is described in iHorowitz and Spokoinyl ( 1200 ll ). 
All those estimators can be immediately generalized to vector-valued response vari- 
ables. For completeness, I describe one estimator here. For % = 1, ...,n define by 
the following recursion: 



i(i) 



arg min \\Xj 



and 



(3.8) 



= arg min \\Xi — X,- 



(3.9) 



Then variance Sj can be estimated by 

± _ TLxiYk - Y m )(Y k - Y m fl(\\X k - Xj\\ < K 

^jyUiHW^k-XiW <b n ) 

where b n denotes some bandwidth value. This estimator will be uniformly consistent 
for Ej over % — 1, ...,n with rate (log n/n) 



(3.10) 



1/(2+0 



max || Ej — Ej 

i=l n 



O r , 



i.e. 



log n 



n 



l/(2+d) 



(3.11; 



if (i) b n x (log n/n) 1 l^ 2+d ^ and (ii) assumptions from section I4TT1 hold where || • || 
denotes the spectral norm on the space of p x p-dimensional symmetric matrices 
corresponding to Eucledian norm on R p . To choose bandwidth value b n in practice, 
one can use any type of the cross validation. An advantage of this estimator is that 
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it is fully adaptive with respect to smoothness properties of regression function /. 

The intuition behind this estimator is based on the following argument. Note that 
j(k) is chosen so that Xj^) is close to X k . If regression function / is continuous, 

Y k — ijf(fc) = f(X k ) — f(Xj(k)) + Ek — Sj(k) ~ £k — £j(k) (3-12) 

so that 

E[{Y k - Y m ){Y k - Y m ) T \ « S fc + Ej^) (3.13) 

since is independent of £ 3 -(&). If & n is small enough and S(X) is continuous, Y, k + 
Ej(fc) w 2Sj since only X k satisfying \\X k — < 6 n are used in estimating 



3.4 Remarks on the Choice of Testing Parameters 

Implementing the deterministic version of the test requires choosing minimal and 
maximal bandwidth values h m \ n and h ma , x and the parameter 7. The randomized 
version of the test also use the parameter (3 and the function g : R — >• [0, 1]. In this 
section, I provide some notes on how to choose these objects for the randomized test 
to make sure that the test maintains the required size. 

First, I recommend to set h max = maxjj = i r .. in \\Xi — Xj\\/2 as a normalization. 
Second, it follows from theorem [1] that the test with RMS test function is not con- 
servative asymptotically only if 7 = 7 n — > as n — > 0. So, I recommend to set 7 as 
a small fraction of a, for example 7 = 0.01 for a = 0.05. Alternatively, one ca n set 



7 = 0.1/log(n) similarly the corresponding choice in IChernozhukov et al.l (120091 ) . 

Next, consider how to choose g , h min , and (3. It follows from theorems [1] and [6] 
and lemma [11] that the test maintains the required size if 

A = glT^^" 3 + 3J jh + ^f 2 ) (IWUool°g (3.14) 

is small in comparison with a (required size) where 

F = (max E[\elJ] + max v^E?^) " (3.15) 
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with both maxima taken over i — I 

b = 



max 

(i,m,h)£S; j=l,...,n 



n and m — 1 , . . . , p and 

w h (Xi,Xj) 



Vi 



(3.16) 



i,m,h 



If /3 <d 1, the good choice of go is given by 



9o{x) 



(16/3)x 3 



if x < 

if x G (0, 1/4] 



7/6-a;-4(a;-l/4) 2 + (16/3)(a;-l/4) 3 if x G (1/4, 3/4] (3.17) 



(16/3)(l-x) 




ifx G (3/4, 1] 
if x > 1 



This function is chosen so that g^i^x) = —32 for x G (0, 1/4], +32 for x G (1/4,3/4], 
and —32 for x G (3/4, 1]. Given this function, if < 1, it is enough to set parameters 
so that 

1.8pbn 1/3 {\og \S\) 2/3 F/(3 5/3 < a (3.18) 

Given h m [ n , b and F can be estimated from the data. Then one can choose (3 so that 
the inequality above is satisfied. Note that there is a trade-off between choosing small 
f3 and small /i min since b is a decreasing function of /i m i n - 

I note that the inequality ( 13 .181) guarantees good size properties of the test uni- 
formly over a large set of the true distributions of disturbances {£j}" =1 . In particular, 
this set includes discrete distributions, which lead to the distributions of the test 
statistic that are difficait to approbate using Gaussian custurbanceg Therefore, 
this inequality is difficult to satisfy in sample sizes typical for economic data. Nev- 
ertheless, this inequality is still useful because it gives a starting point in choosing 
testing parameters. 



5 Similar phenomenon is also known in the classical theory of Central Limit Theorems, see 
Ibragimov and Linnikl ( 1971 ) 
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4 The Main Results 



This section presents my main results. Section H~T1 gives regularity conditions. Section 
14.21 describes size properties of the test. Section POl explains the behavior of the test 
under a fixed alternative. Section l4~4l derives the rate of consistency of the test against 
one-directional alternatives mentioned in the introduction. Section H~5l shows the rate 
of uniform consistency against certain classes of smooth alternatives. Section 14.61 
presents the minimax rate-optimality result. 



4.1 Assumptions 

Let Mh(Xi) be the number of elements in the set {Xj : \\Xj — Aj|| < h, j — 1, n}. 
In what follows, I will write C and its variants for a generic constant whose value 
may vary depending on the context. Results in this paper will be proven under the 
following regularity assumptions. 

Assumption 1. (i) Design points {Xi}™ =1 are nonstochastic. (ii) For some constant 
< C < oo and all i = 1, ...,n, \\Xi\\ < C . (Hi) For some constants < C\ < Ci < 
oo, C x nh d < M h (Xi) < C 2 nh d for alii eN and h e H = H n . 

The design points are nonstochastic because the analysis is conducted condition- 
ally on {Xi}i =1 . Assumption 1 also states that the design points have bounded sup- 
port, which is a mild assumption. In addition, it states that the number of design 
points in certain neighborhoods of each design point is proportional to the volume 
of the neighborhood with the coe fficient of proportionalit y boun ded from above and 
away from zero. It is stated in Horowitz and Spokoinyl ( 2001 ) that assumption [1] 
holds in an iid setting with probability approaching one as the sample size increases 
if the distribution of Xi is absolutely continuous with respect to Lebegue measure, 
has bounded support, and has the density bounded away from zero on the support. 
This statement is actually wrong unless one makes some extra assumptions. Lemma 
[3] in the Appendix gives a counter-example. Instead, lemma H] shows that assump- 
tion Q] holds for large n almost surely if, in addition, I assume that the density of 
Xi is bounded from above, and that the support of Xi is a convex set. Necessity 
of the density boundedness is obvious. Convexity of the support is not necessary 
for assumption [1] but it gives a good trade-off between generality and simplicity. In 



16 



general, one should deal with some smoothness properties of the boundary of the 
support. Note that the statement "for large n almost surely" is stronger than "with 
probability approaching one". Note also that assumption H^iii) requires inequalities 
to hold for all i EN, not just for i = 1, n. 

Assumption 2. (i) Disturbances {ei : i = l,...,n} are independent MP -valued ran- 
dom variables with E[e i)mi ] = 0, E[e i>mi e ijm2 \ = Sj, mim2 < oo ; andE[e i>mi e ijm2 e i>m3 e i>m4 \ = 

s i,m 1 m 2 m 3 m 4 < 00 f or a ^ * = 1; •••) n an d m i> m 2, m 3, m 4 = 1, •••,£>■ (H) For some con- 
stants < C < oo and 5 > 0, E[|£j im | 4+<5 ] < C for all i = 1, ...,n and m = 1, ...,p. 
(Hi) For some constant < C < oo, |£j jmim2 — Ej )J7lim2 | < C\\Xi — Xj\\ and 



I,..., p. (iv) For some constant < C < oo, Sj /mm > C for all i = l,...,n and 
m — 1, ...,p. 

The reason for imposing assumption |2] is threefold. First, finite third moment 
of disturbances is used in the derivation of a certain invariance principle with the 
rate of convergence. As in the classical central limit theorem, finite two moments are 
sufficient to prove weak convergence but more finite moments are necessary if we are 
interested in the rate of convergence. Second, finite 4 + 5 moment of disturbances and 
Lipshitz continuity properties are used to make sure that Sj converges in probability 
to Ej uniformly over i = 1, n for a particular estimator Ej of Ej described in section 
13.31 at an appropriate rate. Finally, I assume that the variance of each component of 
disturbances is bounded away from zero for simplicity of the presentation. Since I use 
a studentization of kernel estimators, without this assumption, it would be necessary 
to truncate the variance of the kernel estimators from below with truncation level 
slowly converging to zero. That would complicate the derivation of the main results 
without changing main ideas. 

Before stating assumption |3l let me give formal definitions of Holder smooth- 
ness class T(t,L) and its subsets J-" f (r, L). For c/-tuple of nonnegative integers 
a = (cki, a.d) with \a\ = a\+...+ad, function g : R d — > K, and x = (x±, ...,Xd) E M. d , 
denote 



whenever it exists. For r > 0, it is said that the function g : IR d — > R belongs to 



st 




< C\\Xi — Xj\\ for all i,j = 1, n and m\, m%, 772,3, m 4 
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the class T(t, L) if it has continuous partial derivatives upto order [r] and for any 
a = (ai, aa) such that \a\ = [r] and x, y G M. d , 

\D a g(x)-D a g(y)\< \\x - y\\^ (4.2) 

Here [r] denotes the largest integer strictly smaller than r. For any g G F{t,L), 
x = (x 1: ...,x d ) G R d , and Z = (/i, Z^) G lR d satisfying El=i 4 = 1, let g^(x) 
denote fc-th derivative of function / in direction I at point x whenever it exists. For 
q = 1,..., [r], let ^\(r, L) denote the class of all elements of F{t,L) such that for 
any g G T,(r,L) and I = (h,...,l d ) e R d satisfying Y? m =i l l = 1, / (fc '%) = for 
all = 1,...,? whenever f^ 1,l \x) = 0, and there exist x = (x\,...,Xd) G M d and 
/ = (h, Id) G M. d satisfying YL=i l i = 1 such that / (c+1 '°(^) ^ and /(^(x) = 0. 
If r < 1, I set q = and ^(r, L) = ^(r, L). 

Assumption 3. (i) For some r > 1/4, L > 0, and q — 1, [r], regression functions 
fm(') — fm,n(') belong to the class J-^t, L) for all m = l,...,p. (ii) If q < [t], then 
for any x G M d and all a = (a±, ad) such that \a\ = q + 1, \D a f m (x)\ < C for some 
constant C > and all m — 1, ...,p. 

For simplicity of notation, I assume that all components of / have the same 
smoothness properties. This assumption is used in the derivation of the power prop- 
erties of the test. The restriction r > 1/4 is also needed to make sure that 
converges in probability to uniformly over % = l,...,n at an appropriate rate. I 
allow regression functions to depend on n to perform a local power analysis. 

Assumption 4. Set of bandwidth values has the following form: H = H n — {h — 
h mSLX a k : h > h min , k = 0, 1, 2, ...} where a G (0, 1), h mSLX = C and h min = /i mirii „ ->■ 
as n — )> oo such that \H n \ < Clogn for some constant C > 0. 

According to this assumption, maximal bandwidth value, /i max , i s independent of 
n. Its value is chosen to match the radius C of the support of design points. It is 
intented to detect deviations from the null hypothesis in the form of flat alternatives. 
Minimal bandwidth value, h niin , converges to zero as the sample size increases in 
such a way that the number of bandwidth values in the set H n is growing at a 
logarithmic rate or slower. This assumption will be satisfied if h m \„ converges to zero 
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at a polynomial rate. Minimal bandwidth value is intended to detect deviations from 
the null hypothesis in the form of alternatives with peaks. 

Assumption 5. Estimators Ej of Ej satisfy max i=lr . . „ ||Ej — Ej|| = o p (n~ K ) with 
k = 1/(2 + d) — <p for arbitrarily small > where \\ ■ \\ Q denotes the spectral norm 
on the space of p x p-dimensional symmetric matrices corresponding to the Euclidean 
norm on MP. 



As follows from 



Muller and Stadtmullerl ( 119871 ). under assumptions |2] and [3j as- 



sumption [5] is satisfied for the estimators Ej of Ej described in section 13.31 In prac- 
tice, due to the course of dimensionality, it might be useful to use some parametric 
or semi-parametric estimators of Ej instead of the estimator described in section 13.31 
Fo r example, if we assume that Ej = Ej for all i,j = l,...,n, then the estimator 
of iRicd ( 119841 ) (or its multivariate generalization) is l/v^n-consistent. In this case, 



assumption [5] will be satisfied with k = 1/2 — <ft for arbitrarily small <fi > 0. 

Assumption 6. (i) The kernel K is positive and supported on {x G M. d : \\x\\ < 1}. 
(ii) For some constant < C < 1, K(x) < 1 for all x G lR d and K(x) > C for all 
\\x\\ < 1/2. 

I assume that the kernel function is positive on its support. Many kernels satisfy 
this assumption. For example, one c an use rectangular, triangular, parabolic, or 



biweight kernels. See iTsybakovl ( 120091 ) for the definitions. On the other hand, the 
requirement that the kernel is positive on its support excludes higher-order kernels, 
which are necessary to achieve minimax optimal testing rate over large classes of 
smooth alternatives. I require positive kernels because of their negativity-invariance 
property, which means that any kernel smoother with a positive kernel maps the 
space of negative functions into itself. This property is essential for obtaining a test 
with the correct asymptotic size when smoothness properties of moment functions 
are unknown. With higher-order kernels, one has to assume undersmoothing so that 
the bias of the estimator is asymptotically negligible in comparison with its standard 
deviation. Otherwise, large values of T might be caused by large values of the bias 
term relative to the standard deviation of the estimator even though all components 
of f{X) are negative. However, for undersmoothing, one has to know the smoothness 
properties of f(X). In constrast, with positive kernels, the set of bandwidth values can 
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be chosen without reference to these smoothness properties. In particular, the largest 
bandwidth value can be chosen to be independent of the sample size n. Nevertheless, 
the test developed in this paper will be rate optimal in the minimax sense against 
class J-"[ T ] (t,L) when r > d. 

Assumption 7. (i) For some constant C > 0, f3 = j3 n < C . (ii) (log n) 4 / h^- m n) — > 
as n oo. 

Assumption [7] establishes the trade-off between choosing small value of /3 and 
small value of h m { n . It is a key condition used to establish an invariance principle that 
shows that asymptotic distribution of T depends on the distribution of disturbances 
{ei : i — 1, ...,n} only through their covariances {Sj : i = 1, ...,n}. Under somewhat 
stronger conditions, corollary [1] shows that I can set — 0, which corresponds to the 
determinstic version of the test. Note that from assumption E^ii), it follows that h min 
converges to zero at a polynomial rate which is consistent with assumption HI 

Assumption 8. (i) For every h G H n , set of test points Ih = Ih,n is such that 
\\Xi — Xj\\ > 2h for all i,j e Ih, n with i ^ j and for each i — 1, ...,n, there exists 
an element j(i) G Ih, n such that \\Xi — Xju\\\ < 2h. (ii) S = S n = {(i,m,h) : h G 
H n ,i G I h ,n,m = 1, ...,£>}. 

Denote the class of models satisfying assumptions [2] and |3] for some fixed values of 
all constants by Q. Each element w G Q consists of a pair (f w , e w ), where f w denotes 
the regression function and e w denotes all the information about the distribution of 
disturbances in model w. Denote the subset of models satisfying / < almost surely 
by Go- 

4.2 Size Properties of the Test 

Analysis of size properties of the test is complicated because the asymptotic distri- 
bution of the test statistic is unknown. Instead, I use a finite sample approach based 
on the Lindeberg method. For each sample size n, this method gives an upper error 
bound on approximating the expectation of smooth functionals of the test statistic 
by its expectation calculated assuming Gaussian noise {ei}^ =1 . I also derive a simple 
lower bound on the growth rate of the pdf of the test statistic to show that the expec- 
tation of smooth functionals can be used to approximate the expectation of indicator 
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functions. Combining these results leads to the approximation of the cdf of the test 
statistic by its cdf calculated assuming Gaussian disturbances with an explicit error 
bound. This allows me to derive certain conditions which insure that the error con- 
verges to zero as the sample size n increases, which is a key step in establishing the 
bootstrap validity. 

The first theorem states that the test has correct asymptotic size uniformly over 
the class of models Qq both for plug-in and RMS test functions. In addition, the test 
with the plug-in test function is nonconservative as the size of the test converges to 
the required level a uniformly over the class of models Qq with f w = P . When I set 
7 = In — > 0, the same holds for the test with the RMS test function. 

Theorem 1. Let assumptions 1-8 hold. Then for P = PI A or RMS, 

inf E w [g[_ a (f)} >l- a + o(l) (4.3) 

weg 

In addition, 

sup E w [g(^{f)] = l-a + o(l) (4.4) 

weGo,f w =o P 

and if 0, then 

sup E[g« M a s (f)] = l-a + o(l) (4.5) 

weg J w =o p 

as well. 

Proofs of all results are presented in the Appendix. From the proof of theorem (TJ 
I also have 

Corollary 1. If instead of^ii) we assume (log n) 19 /(fe^ in n) — > 0, then theorem 1 
holds with (3 = n = 0. 

The case = corresponds to the deterministic version of the test, which rejects 
the null if and only if T > cf_ a for P = PI A or RMS. However, I can guarantee 
that this test maintains the required size only if h min converges to zero very slowly 
since (log n) 19 is a very large number for reasonable sample sizes. 
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4.3 Consistency Against a Fixed Alternative 

Let me introduce a distance between model w £ Q and the null hypothesis: 

p(w,H )= sup [fZ{Xi)] + (4.6) 

i=l,...,oo; m=l,...,p 

For any alternative outside of the set 0/, p(w, H ) > 0. In this section, I state the 
result that the test is consistent against any fixed alternative w with p(w,H ) > 
satisfying assumptions [THHJ Moreover, I show that the test is consistent uniformly 
against alternatives whose distance from the null hypothesis is bounded away from 
zero. For p > 0, let Q p denote the subset of all elements of Q such that p(w, H ) > p 
for all w £ Q p . Then 

Theorem 2. Let assumptions 1-8 hold. Then for P = PI A or RMS, 

sup E w [g(_ a {f)\ -+ (4.7) 

w&Qp 

as n —¥ oo. 

4.4 Consistency Against One-Directional Alternatives 

Let w(0) £ Q be such that p(w(0), H ) > 0. For some sequence {a n }^ 1 of positive 
numbers converging to zero, let f n = a n f w ^ be a sequence of local alternatives. I 
refer to such sequences as local one-directional alternatives. This section establishes 
the consistency of the test against such alternatives whenever y/n/ \ogna n — > oo. 

Theorem 3. Let assumptions 1-8 hold. Then for P = PI A or RMS , 

sup E w [g^_ a (f )} — ^ (4.8) 
wegj m =f n 

as n oo if \J~nj log na n — > oo . 

Remark. Recall the CMI model from the first example mentioned in the introduction 
where m(X, W, 9) = 6m(X, W) and E[m(X, W)\X] > almost surely. The theorem 
above shows that the test developed in this paper is consistent against sequences of 
alternatives 9 = 9 0:n whenever ^Jn/ log n# ,n — > oo in this model. So, my test is 
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consistent against virtu ally the same set of alternatives in this model as the test of 
Andrews and Shil feoiOl l. 



4.5 Uniform Consistency Against Holder Smoothness Classes 

In this section, I present the rate of uniform consistency of the test against the class 
(r, L) under certain additional constraints. These additional constraints are needed 
to deal with some boundary effects. Let S = cl{Xj : i e N} denote the closure of 
the infinite set of design points. For any -d > 0, let S$ be the subset of S such that 
for any x G S#, the ball with center at x and radius B#(x), is contained in S, i.e. 
B$(x) C S. Denote ( = min(<j + l,r). When ( < d, set & = $ n = 4Vdh min . When 
C > d, set $ = $ n = 4Vd(\ogn/n) l ^ +d \ Let N*„ = {i e N : X { e S»J. For any 
w E 9, let 



p, dn (w,H 



sup 

i£N^, m=l, 



[fZ{Xi)Y 



(4.9) 



denote the distance between w and H over set S# n . For the next theorem, I will 
use p# n -metric (instead of p-metric) to measure the distance between alternatives and 
the null hypothesis. Such restrictions a re q uite common in t he literature. See, for 



Leeetal 



(120111 ). Let be the subset 



example, iDumbgen and Spokoinyi (120011 ) and 
of all elements of Q such that inf^g^ p# n (w, H ) > Ch^ in for some large constant C 
if £ < d and mi we g^ p# n (w, H )(n/ logn) c/(2C+d) ->• oo if C > d. Then 

Theorem 4. Let assumptions 1-8 hold. For P = PI A or RMS , if (i) £ < d or (ii) 
( > d and h m - m < (logn/n) 1 ^ 2 ^"^ for large enough n, then 



sup E w [gf_ a (T)] ^ 



(4.10) 



as n — >■ oo. 



Remark. Recall the CMI model from the second example mentioned in the introduc- 
tion where m(X, W, 9) = m(X, W) + 9. Assume that X e R and E[m(X, W)\X] = 
-\X\ U with v > 1. In this model, the identified set is 6/ = {9 e R : 9 < 0}. The 
theorem above shows that the test developed in this paper is consistent against se- 
quences of alternati ves 9 n = 9 n „ whene ver (n/ log n ) u ^ 2u+1 ^ fln.n. — > oo . At t he same 
time, it follows from lArmstrond (j2011al ). the test of lAndrews and Shil (120101 ) is con- 



23 



sistent only if n u /M v+1 »6 n , o — > oo, so their test has a slower rate of consistency than 
that developed in this paper. 

4.6 Lower Bound on the Minimax Rate of Testing 

In this section, I give a lower bound on the minimax rate of testing. For S$ de- 
fined in the previous section, let N(h,S# n ) be the largest m such that there exists 
{xi, x m } C S& n with \\xi — Xj\\ > h for all i,j = l,...,m if i ^ j. I will assume 
that N(h, S& n ) > Ch~ d for all h G (0, 1) and large enough n for some constant C > 0. 
This condition holds almost surely under the conditions of lemmaHl Let <p n (Y\_, Y n ) 
denote a sequence of tests, i.e. (f> n (Yi, Y n ) equals the probability of rejecting the 
null hypothesis upon observing sample Y = (Yi, Y n ). 

Theorem 5. Let assumptions 1-8 hold. Assume that (i) N(h,S$ n ) > Ch~ d for all 
h G (0,1) and large enough n for some constant C > 0, (ii) q = [t], and (in) 
r n (n/ log n) T /( 2r+d ) — >■ as n — >■ oo for some sequence of positive numbers r n . Then 
for any sequence of tests (j) n {Y u Y n ) with sup„, 6 g E w [(f> n (Yi, Y n )} < a, 

limsup inf E^iYi, ...,Y n )] <a (4.11) 

Since J-[ T ](r, L) C T{j, L), the same lower bound applies for the class T{j,V) as 
well. Comparing this result with theorem 4 shows that the test presented in this 
paper is minimax rate optimal if £ = r > d and h min is chosen to converge to zero 
fast enough. When Q = t = d and /3 n is set to be constant, the test is rate optimal 
upto some logarithmic factors if h m i n is chosen to converge to zero as fast as possible 
satisfying assumption [71 When r < d, the test is not rate optimal since the rate of 
consistency does not match the lower bound. 

5 Models with Infinitely Many CMI 

In this section, I briefly outline an extention of the test to the case of infinitely many 
CMI. Suppose that the parameter 6 is restricted by a countably infinite number of 
CMI, i.e. p = oo. As before, I am interested in testing the null hypothesis, H , that 
9 = 6q against the alternative, H a , that 9 ^ 9 . One possible approach to testing 
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in this model is to construct a test as described in section [3] based on some finite 
subset of CMI assuming that as the sample size n increases, this subset expands 
covering all CMI in the asymptotics. The advantage of the finite sample approach 
used in this paper is that it immediately gives certain conditions that insure that such 
a test maintain the required size asymptotically. Assume that the test is based on 
K = K n — » oo inequalities. Then 

Corollary 2. Let assumptions [0(7} an d hold. In addition, assume that (i) 
max t=i,...,n — Sj|| = o p {n~ K ) for some k > 0, (ii) K n \ogn/n K ^ 4 — > 0, (Hi) 
P = 0n < C, and (iv) f^(logn) 4 / '{P^h^n) -»■ as n -»■ oo. Then for P = PI A or 
RMS, 

inf E w [g(_ a (f)} >l-a + o(l) (5.1) 

as n — >■ oo. In addition, 

E w [g[- a (f)} (5.2) 

for any w G Q p with p > 0. 

This corollary shows that the randomized test has correct asymptotic size both 
with plug-in and RMS critical values and is consistent against fixed alternatives out- 
side of the set Qj. Note that k appearing in condition (i) in this corollary will 
generally be different from k used in assumption [5] because of increasing number of 
moment functions. Results concerning the test with determinstic critical values and 
local power of the test, with suitable modifications, can also be easily obtained using 
arguments similar to those used in the proofs of corollary [1] and theorems [3] and HI 
For brevity, I do not discuss these results. 



6 Monte Carlo Results 



In this section, I present results of Monte Carlo simulations. The aim of these simu- 
lations is twofold. First, I demonstrate that my test accurately maintain size in finite 
samples reasonably well. Second, I compare relat i ve advantages and di s advan tages 
of my test and t he tests of 



Lee et al. 



Andrews and Shi ( 



2010), 



Chernozhukov et al 



(1201 If ) . The methods of Andrews and Shi ((20101) and 



Lee et al. 



2009), and 



(1201 lh are 



most appropriate for detecting flat alternatives, which represent one-directional local 
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alternatives. These methods have 
The test of Chernozhukov et al- 



ow p ower against alternatives with peaks, however. 



(120091 ) has higher power against such alternatives, 



but it requires knowing smoothness properties of the moment functions. The authors 
suggest certain rule-of-thumb techniques to choose a bandwidth v alue. Finally, the 
main adva ntage of my test is its adaptiveness. In comparison with 



Andrews and Shi 



(120 10h and 



Lee et al 



( 20111). my test has higher power against alternatives with peaks. 



In comparison with I Chernozhukov et al.l (120091 ). my test has higher power when their 
rule-of-thumb techniques lead to an inappropriate bandwidth value. For example, 
this happens when the underlying regression function is mostly flat but varies sig- 



nificantly in the region where t 
inhomogeneous alternatives, see 



re null hypothesis is violat ed (the case of spatially 



Lepski and Spokoinyj (Il999i )). 



The data generating process in the experiments is 



Y = L(M -\X\) + -m + £ 



(6.1) 



where X, Y, and e are scalar random variables and L, M, and m are some constants. 
X is distributed uniformly on (—2, 2). Depending on the experiment, e is distributed 
according to 0.1-iV(0, 1) or (£-0.07+(l— £)-0.18)-iV(0, 1) where £ is a Bernoilly random 
variable with p(£ = 1) = 0.8 and p(£ = 0) = 0.2 independent of N(0, 1). In both 
cases, e is independent of X. I consider the following specifications for parameters. 
Case 1: L = M = m = 0. Case 2: L = 0.1, M = 0.2, m = 0.02. Case 3: L = M = 0, 
m = -0.02. Case 4: L = 2, M = 0.2, m = 0.2. Note that E[Y\X] < almost surely 
in cases 1 and 2 while P{i£ |V|.X"] > 0} > in cases 3 and 4. In case 3, the alternative 
is flat. In case 4, the alternative has a peak in the region where the null hypothesis is 
violated. I have chosen parameters so that rejection probabilities are strictly greater 
than and strictly smaller than 1 in most cases so that meaningful comparisons are 
possible. I generate samples (Xi, Yi)™ =l of size n = 250 and 500 from the distribution 
of (X, Y). In all cases, I consider tests with the nominal size 10%. The results are 
based on 1000 si mulations for each s pecifi cation. 



For the test of 



Andrews and Shil (120101 ). I consider their Kolmogorov-Smirnov test 



statistic with boxes and truncation parameter 0.05. I simulate both plugin (AS, 
plugin) and GMS (AS, GMS) critical values based on the bootstrap suggested in 
their paper. I use the support of the empirical distribution of X to choose a set of 
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weighting functions. All other tuning parameters are set as prescribed in their paper. 
Implementing all other tests requires selecting a kernel function. In all cases, I use 
the following kernel function 



K(x) 



1.5(1 -4x 2 



(6.2) 



For the test of 



Chernozhukov et al. 



(120091 ) . I use their kernel type test statistic with 
ier bootstrap both with ( CLR, V) and without 



Chernozhukov et al. 



(12009f ) and 



Lee et al. 



(1201 lh 



critical values based on the multip 
(CLR, V) the set estimation. Both 
(LSW) circumvent edge effects of kernel estimators by restricting their test statistics 
to the proper subsets of the support of X. So, I select 10 and 90% quantiles of 
the empirical distribution of X as bounds for the set over which the test statistics 
are calculated. Both tests are nonadaptive. In particular, there is no formal theory 
on how to choose bandwidth val ues in their tests . I use their suggestions to choose 

(1201 1[ ). I use their test statistic based on 



Lee et al. 



bandwidth values. For the test of 
one-sided Li-norm. 

Let me now describe the choice of parameters for the test developed in this paper. 
The largest bandwidth value, h max , is set to be one half of the length of the support 
of the empirical distribution. I choose the smallest bandwidth value, h m i n , so that 
the kernel estimator uses on average 15 data points when n = 250 and 20 data points 
when n = 500. The scaling parameter, a, equals 0.8 so that the set of bandwidth 
values is 

H n = {h = h max 0.8 k : h > h min , k = 0, 1, 2, ...} (6.3) 

My test requires choosing the set S n . For each bandwidth value, h, I select the largest 
subset, S n> h, of Xj's such that Xi — Xj>h for any nonequal elements in S n> h, and the 
smallest Xi is always in S n> h. Then S n — {(i, h) : h G H n , Xi e S n ^. In all cases, I 
set (3 = so that the deterministic version of the critical values is used. Finally, for 
the RMS critical value, I set 7 — 0. 
the test of 



Chernozhukov et al. 



./ log(ra) to make meaningful comparisons with 
((2009J). In all bootstrap procedures, for all tests, I use 
1000 repetitions when n = 250 and 500 repetions when n = 500. 

The results of the experiments are presented in table 1 for n = 250 and in table 
2 for n = 500. In both tables, my test is denoted as Adaptive test with plug-in and 
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Table 1: Results of Monte Carlo Experiments, n = 250 



Probability of Rejecting Null Hypothesis 



Distribution e 


Case 


AS, plugin 


AS, CMS 


LSW 


CLR, V 


CLR, V 


Adaptive 
test, plugin 


Adaptive 
test, RMS 




1 


0.099 


0.102 


0.124 


0.151 


0.151 


0.101 


0.101 




2 


0.002 


0.007 


0.000 


0.008 


0.008 


0.009 


0.009 


Normal 




















3 


0.910 


0.910 


0.941 


0.808 


0.808 


0.723 


0.723 




1 


0.000 


0.143 


0.000 


0.122 


0.191 


0.589 


0.821 




1 


0.078 


0.086 


0.107 


0.134 


0.134 


0.124 


0.124 




2 


0.002 


0.002 


0.000 


0.010 


0.010 


0.016 


0.016 


Mixture 




















3 


0.904 


0.905 


0.925 


0.833 


0.833 


0.692 


0.692 




4 


0.000 


0.121 


0.000 


0.111 


0.197 


0.555 


0.808 



RMS critical values. Consider first results for n = 250. In case 1, where the null 
hypothesis holds, all tests have rejecting probabilities close to the nominal size 10% 
both for normal and mixture of normals dis turbances. In part i cular, RMS procedure 
for my test, GMS procedu re for the test of lAndrews and Sh.il (120101 ) and the test of 



Chernozhukov et al. 



( 120091 ) with the set estimation do not overreject, which might 
be concerned based on the construction of these tests. In case 2, where the null 
hypothesis holds but the underlying regression function is mainly strictly below the 
borderline, all tests are conservative. When the null hypoth esis i s violated with a 

(120 111 ) 



Lee et al. 



flat alternative (case 3), the tests of lAndrews and Shil (120101 ) and 
have highest rejection probabilities as expected from the theory. In this case, my 
test is les s powerful in compa r ison w ith these tests and somewhat similar to the 
method of I Chernozhukov et al.l (120091 ) . This is compensated in case 4 where the null 
hypothesis is violated with the peak-shaped alternative. In this case, the power of 
my test is much higher than that of competing tests. This is especially true for my 
test with RMS critical values whose rejection probability exceeds 80% while rejection 
probabilities of competing tests do not exceed 20%. Note that all results are stable 
across distributions of disturbances. Also note that my test with RMS critical values 
has much higher power than the test with plugin critical values in case 4. So, among 
these two tests, I recommend the test with RMS critical values. Results for n = 500 
indicate a similar pattern. Concluding this section, I note that all simulation results 
are consistent with the presented theory. 
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Table 2: Results of Monte Carlo Experiments, n = 500 



Probability of Rejecting Null Hypothesis 



Distribution e 


Case 


AS, plugin 


AS, CMS 


LSW 


CLR, V 


CLR, V 


Adaptive 
test, plugin 


Adaptive 
test, RMS 




1 


0.095 


0.104 


0.119 


0.126 


0.126 


0.103 


0.103 




2 


0.000 


0.001 


0.000 


0.002 


0.002 


0.008 


0.008 


Normal 




















3 


0.997 


0.997 


0.996 


0.954 


0.954 


0.903 


0.903 




1 


0.008 


0.587 


0.000 


0.497 


0.694 


0.976 


0.999 




1 


0.120 


0.123 


0.130 


0.117 


0.117 


0.119 


0.119 




2 


0.000 


0.001 


0.000 


0.000 


0.000 


0.010 


0.010 


Mixture 




















3 


0.993 


0.993 


0.996 


0.949 


0.949 


0.903 


0.903 




4 


0.005 


0.549 


0.000 


0.456 


0.625 


0.978 


0.997 



7 Conclusions 



In this paper, I developed a new test of conditional moment inequalities. In contrast 
to some other tests in the literature, my test is directed against general nonparamet- 
ric alternatives, which gives high power in a large class of CMI models. Considering 
kernel estimates of moment functions with many different values of the bandwidth 
parameter allows me to construct a test that automatically adapts to the unknown 
smoothness of moment functions and selects the most appropriate testing bandwidth 
value. The test developed in this paper has uniformly correct asymptotic size, no 
matter whether the model is identified, weakly identified, or not identified, and is 
uniformly consistent against certain, but not all, large classes of smooth alternatives 
whose distan ce from the null hypothe sis co nverges to z ero a t a fastest possible rate. 



Andrews and Shil (120101 ) and 



Leeetal 



(1201 lh have nontrivial power 



The tests of 

against n -1 / 2 -local one-directional alternatives whereas my method only allows for 
nontrivial testing against [nj logn) _1 ' 2 -local alternatives of this type. Additional 
(logn) 1 / 2 factor should be regarded as a price for having fast rate of uniform consis- 
tency. There exist sequences of local alternatives against which their tests are not 
consistent whereas mine is. Monte Carlo experiments give an example of a CMI model 
where finite sample power of my test greatly exceeds that of competing tests. 
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A Appendix 



This Appendix contains proofs of all results stated in the main part of the paper. 
Section [A.ll explains the equivalent representations for the randomized test. Section 
IA.2I derives a bound on the modulus of continuity in the operator norm of the square 
root operator on the space of symmetric positive s emidefinite matric es. Section IA.3I 
gives a straighforward generalization of results in IChatterjed (120051 ) to the case of 
multidimensional random variables. They are concerned with conditions when the 
distribution of some function of several independent random variables with unknown 
distributions can be approximated by substituting Gaussian distributions with the 
same first two moments. They are based on the Linderberg's argument. The result is 
specialized to the situation when the function of interest can be written in the form 
of the maximum of linear functions of the data. These results have their own value 
as they can be used as an alternative to results on stochastic approximation from 
empirical process theory. They are also useful because they give an explicit bound on 
the approximation error. Section IA.4I gives sufficient conditions for assumption [1] in 
the main part of the paper. Section IA.5I presents an anticoncentration inequality for 
the maximum of Gaussian random variables with unit variance. Section IA. 61 describes 
a result on Gaussian random variables which is used in the proof of lower bound on 
the minimax rate. Section IA.7I develops some preliminary technical results necessary 
for the proofs of the main theorems. Finally, section IA.8I presents the proofs of the 
theorems stated in the main part of the paper. 

Note that all convergence results proven in this Appendix hold uniformly over the 
class of models Q. This fact will not be stated seperately in each special case, but it 
is assumed everywhere in this Appendix. 



A.l Lemma on the equivalent representation of the test 

The lemma below was used in section 13.21 to show that the randomized test is equiv- 
alent to the test with the random critical value. 

Lemma 1. E[g(T)\ = P{f < 
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Proof. Since g(f ) e [0, 1] almost surely, 

E[g{f)} = [ P{g{f) > x}dx (A.l) 
J o 

Given that U is independent of the data and, hence, of T, 

P{g{f) > x}dx = P{g(f) > U} (A.2) 

Finally, note that {g(T) > U} is equivalent to {T < so that 

P{g(f) >U} = P{f < t^ a } (A.3) 
Combining (1A.1I) . (1A.2j) . and (1A.3I) gives the result. □ 



A.2 Continuity of the square root operator on the set of 
positive semidefinite matrices 

Lemma 2. Let A and B be p x p- dimensional symmetric positive semidefinite ma- 
trices. Then {{A 1 / 2 — B^ 2 \\ < p 1 / 2 ^ — B\\l^ 2 where \\ ■ \\ means the spectral norm 
corresponding to the Euclidean norm on W. 

Proof. Let a\,...,a p and bi,...,b n be orthogonal eigenvectors of matrices A and B 
correspondingly. Without loss of generality, I can and will assume that ||aj|| = ||6j|| = 
1 for all i — 1, ...,p where || • || denotes the Euclidean norm on R p . Let X\(A), X P (A) 
and X\(B), X P (B) be corresponding eigenvalues. Let fa, fi P be coordinates of dj 
in the basis (6j, ...,b p ) for all i — 1, Then X]j=i /i§ = 1 f° r a ^ ^ = 1, 
For any i — 1, 

]f>(A)-A,(i?)) 2 / 2 = llf^^-A^B))/^ 2 



= - B)ai\\ 2 
< \\A-B\\l 



|2 
'J ll 
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since ||(A-S)aj|| < \\A - B\\ \\ai\\ = \\A — B\\ . 

ForP = A,B, P 1 ' 2 has the same eigenvectors as P with corresponding eigenvalues 
equal to A^ 2 (P), Xn 2 (P). Therefore, for any 2 = 1, ••■,p, 

MA^-B^f = Y,{A'\A)-\f{B)ffl 
< IIA-Sl 



< 



< 



where the last line used the inequality derived above. For any c G W with ||c|| = 1, 
let di, ...,d p be coordinates of c in the basis (ai, a p ). Then 

8=1 

< j^mu 1 ' 2 - b 1 ^ 



1=1 
p 



< 



^ip-puy 



/2 



8=1 



< p X /2|| A _ B ||l/2 



since Eti^ 2 = 1- Thus, \\A X I 2 - B l ' 2 \\ < p l ' 2 \\A- B\\ l J 2 . □ 



A. 3 Invariance principle 



In this section, I generalize results of IChatterjed ( 120051 ) to the case of random vectors 



(p > 1). I also specialize results for the case of linear functions because it allows to 
greatly improve some constants in Chatterjee's derivation. Let Z\, Z n be a sequence 
of independent p-dimensional random vectors with E[Zj] = for all j = l,...,n. 
Denote Z = (Zi, Z n ). For each k = 1,...,K and m = l,...,p, let fk m (Z) = 
Y^j = i a kjmZj, m be some linear function of Z where ajy m > for each k = 1,...,K, 
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j = l,...,n, and m = l,...,p, and Zj, m denotes m-ih component of vector Zj. Let 
Ui,...,U n be a sequence of independent normal p-dimensional random vectors such 
that E[Uj] = and E[ZjZj] = E\UjUj} for each j = 1, n. Denote U = (U u U n ) 
and 

= Ib'loo + 3|b"||oc+lb'||oc (A.4) 
Denote a = max k 3 - TO cikjm- Then 

Theorem 6. For any thrice differentiable function g on M., 



E[g(max f km (Z))] - E[g (max f km (U))] < 

k,m k,m 



(3/6 1 / 3 ) P a(C(g)n) 1 /%\\g'\\^\og(Kp)) 2 / 3 {maxE[\Z^ 



3,m 



3,m 



Remark. The constant in the inequality above can be improved somewhat by using 
expressions for A\, A 2 , and A 3 in the proof given below. I do not follow this step 
because that would mess up the statement of the theorem significantly. 



Proof. As in lChatteriee! (120051 ). for a > 1, let F a : W xn be such that 



F a (x) = a 1 log(^ exp(af hm (x))) 



k,rn 



(A.5) 



for all x G R pxn . Then 



maxf km (x) 

k,m 



= a 1 \og(exp(amaxf km (x))) 

k,m 

< a _1 log(^exp(a/ fcm (x))) 



< a 



1 \og(Kpexp(am&xf km (x))) 

k,m 

< a' 1 log(Kp) + max f km (x) 

k,m 



So, 



max/ fcm (x) - F a (x)\ < a l \og(Kp) 

k,m 



(A.6) 



33 



Thus, 



\E[g(maxf km (Z))] - E\g(maxf km (U))]\ < 



k.m 



k.m 



2\\g'\\ OD a~ 1 \og(Kp) + \E[g(F a (Z))} - E[g(F a (U))}\ 



For any j = 0, ...,n, denote = (Z±, Zj, Uj+i, U n ). Then 

n 

\E[g(F a (Z))} - E[g(F a (U))}\ < £ \E\g(F a {Z')] - E[g(F(Z^))}\ (A.7) 

3=1 

For Z 1: Zj_ 1: Uj+i, U n fixed, denote l(Zj) = g(F a (Z j )). By Taylor formula, 



g(F a (Z j ) — g(F a (Z j ~ 1 ) = l{Zj)-l{Uj) 



+ (1/2) E 



<9 2 Z(0) 



mi,m2 



9Zj mi dZj m , 



(0)( y Zj mi Zj m , 2 Uj mi Uj m2 ) 



+ (1/6) £ 



<9 3 /(z;) 



7 7 7 



mi,m2,m3 



r)7 r)7 r)7 ^mi^jm 2 ^jm 3 



mi,m 2 ,m 3 



(1/6) £ TT^ 777? 777; Uj mi Uj m2 Uj m3 
UZj jmi UZj jrri2 UZj jmz 



where Z and C/ are on the lines connecting and Zj and and Uj correspondingly. 
By independence, 



\E[g(F a (P)]-E[g(F(Z^))}\ 

d 3 g(F a (X)) 



<(l/6) £ sup 

mi,m2,m3 

By Holder inequality, 



9Xj mi dXj m2 0Xj m3 



{E[\Zj mi Zj m2 Zj m3 \] J rE[\Uj mi Uj m2 U. 



jm :i | 



E[\Zj mi Zj m2 Zj m3 \] < max E[\Zj m \ 3 } 



(A.8) 
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and 



E[\U jmi U jm2 U jms \] < maxE[\U jm \ 3 } 



Denote 



Ai = sup 



dF a (X)dF a (X)dF a (X) 



9Xj mi 0Xj m2 0Xj m3 



(A.9) 
(A.10) 



l 2 = sup 



dF a (X) d 2 F a (X) 



dXj m 1 dXj m2 dXj m:i 



+ 



sup 



dF a (X) d 2 F a (X) 



dXj m2 OXj mi 0Xj m3 



+ sup 



dF a (X) d 2 F a (X) 



dXj m:j OXj mi 0Xj m2 



and 



Then 



1 3 = sup 



d 3 F a (X) 



9Xj mi 9Xj m2 9Xj m3 



sup 



d 3 g(F a (X)) 



dXj mi 0Xj m2 0Xj m3 



< \W"uc, 



Ai + WWooAz + WWooAa 



(A.11) 



(A.12) 



So, it only remains to bound partial derivatives of F a . 

To simplify notation, denote B km = exp(af km (X)) for k = 1,...,K and m = 
1, ...,p. Then 

dF a (X) _ Y.kBk mi 



9Xj mi 



V R (A ' 13) 

The expression on the right hand side of the formula above is the expectation of a 
random variable which takes value a k j mi with probability B kmi / Ylkm Bkm for k = 
1,...,K and with probability 1 - ^ k B kmx j ^ km B km . If m 1 , m 2 , and m 3 are all 
different, then 

dF a (X) dF a (X)dF a (X) 



dXj mi 0Xj m2 0Xj m3 



(A. 14) 



will be the product of expectations of 3 random variables with nonitersecting supports. 
It is easy to see that this product will be not greater than a 3 /27. All other cases can 
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be treated by the same argument. We have 



a 3 /27 if mi, m 2 , and m 3 are all different 



Ai < < 4a 3 /27 
a 3 



if mi = m 2 7^ m 3 
if mi = m 2 = m 3 



(A.15) 



If mi, 777.2, and m 3 are all different, then 



d 2 F a {X) 
OXj mi 0Xj m2 



—a 



(A.16) 



and 



a 3 F a (X) 



OXj mi 0Xj m2 0Xj m3 



= 2a- 



m 3 



Cl2km BkmY 



(A.17) 



If mi = m 2 7^ m 3 , then 
d 2 F a (X) 



dXj mi 0Xj m2 



^2ik Bkm 1 0'kjm 1 



(A.I8) 



and 



d 3 F Q (X) 
OXj mi 0Xj m2 0Xj m3 



^- im-iP X j mi 

_ (yifr Bknu^kjim) 2 Ylk Bkm 3 &kjmj, _ a 2^k Bkm 1 a kjm 1 Sfe 



'kmz^kjmz 



If mi = m 2 = m 3 , then 



d 3 F Q (X 



9Xj mi 0Xj m2 OXj m:i 



= a 



mi 



Bkm 1 dkjm 1 ^2k Bkm^kjrm ^2 (Sfc Bkm 1 a 'kjm 1 )' 



36 



So, 



3aa 3 /27 if mi, m 2 , and m 3 are all different 



and 



59aa 3 /108 
3aa 3 



if mi = !Tl2 7^ ?Ti3 

if mi = rri2 = rris 



A, < < 



2a 2 a 3 /27 if mi, rri2, and are all different 
8a 2 a 3 /27 if m x = m 2 7^ m 3 

if mi = m 2 = 



a 2 a 3 



Therefore, 



(A.19) 



(A.20) 



\E[g (max f km (Z))] - E[g (max f km (U))]\ 

k,m k,m 

3 2 3 

n /n 1 , < 1 np era „. . 

< 2 </ U« log ^ + C 4 

o 



Optimizing with respect to a yields the result. 



max E\\Zj m \ + max E\\U. 



3,m 



□ 



A. 4 Primitive Conditions for Assumption 1 

In this section, I give a counter-example for the statement that for assumption [T] to 
hold, it siffices to assume that {Xi : i = l,...,n} are sampled from a distribution 
that is absolutely continuous with respect to Lebegue measure, has bounded support, 
and whose density is bounded from above and away from zero on the support. I also 
prove that assumption [T] holds if, in addition to above conditions, one assumes that 
the support is a convex set. 

Lemma 3. There exist a probability distribution on [—1, l] 2 which is uniform on 
its support such that if {X; : i = l,...,n} are sampled from this distribution, then 
assumption \J\ fails. 

Proof. As an example of such a probability distribution, consider the uniform distri- 
bution on 



S = {(x!,x 2 ) E [-1, l] 2 : x x > 0; -(1 + a)a£/2 < x 2 < (1 + «K/2} (A.21) 
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for some a > 0. For fixed i, the probability that < h is p = h , and the 

probability that > h is p = 1 — h . Let A n be an event that X^i < h for 

exactly one % = 1, ...,n whereas > /i for all other i = 1, ...,n with h < h. The 
probability of this event is 

P(A n ) = npf- 1 = nh 1+a (l - t^y- 1 (A.22) 

Set h = (C 1 /n) 1 l {l+c ^ and h = ^/n) 1 ^ 1 ^ with < G x < C 2 < 1. Then we can 
find the limit of P(A n ) asn-> oo: 

lim P(A n ) = lim d(l - C 2 /n) n - 1 = de' 02 > (A.23) 

n— ^oo n— »oo 

Note that on A n , there is an observation Xi such that there is no other observations 
in the ball with center at Xi and radius (C^ 1 "^ — Cl^ 1+a ^)/n 1 ^ 1+a \ The result now 
follows by choosing a sufficiently large such that n^ l ^ 1+ °^ converges to zero slower 
then h min . □ 

Now I give a sufficient primitive condition for assumption [TJ 

Lemma 4. If {Xi : i = 1, ...,n} are sampled from a distribution which is absolutely 
continuous with respect to Lebegue measure, has bounded and convex support S C M. d , 
and whose density is bounded from above and away from zero on the support, then 
assumption \J\ holds for large n almost surely. 

Proof. Consider sets of the following form: I(ai, c) — SC\{x : ai%\ + ... + adXd = 
c} with a\ + ... + a d — 1. These are convex sets. It follows from the fact that the 
density is bounded from above that inf aiv . >ad sup c D(I(ai, a d , c)) > where D(-) 
denotes the diameter of the set. So, there exists some constant < C < 1 such that 
for all r < 1 and all x G S, each ball with center at x and radius r has at least fraction 
C of its Lebegue measure inside of the support S: X(B(x,r) fl S)/\(B(x, r)) > C. 

Note that 5-covering numbers of the set S satisfy N(8) < 5 d as 5 — > 0, i.e. there 
exists some constant C > such that N(8,S) < C/5 d . Consider the lower bound. 
For each h e H n , consider the set of covering balls with centers Gh,i,--.,Gh,N(h) and 
radii 5h = h/2. Then for each Xi and h e H n , there exists some j e {1, N(h)} 
such that B(Xi, h) D B(Gh,j, 5h)- Thus, it is enough to prove the lower bound for the 
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number of observations droping into these covering balls. Since the density is bounded 
away from zero, there exists some constant C > such that for each h £ H n and 
j = l,...,N(h), P{X t £ B(G hJ ,8 h )) > Ch d . De note L ffl = W £ B(G h j, S h )}. 



A Hoeffding inequality (see proposition 1.3.5 in iDudleyl (119991 )) gives 



P{^I h j{Xi)/n < Ch d /2} < P{J2hAXi)/n-E[I h4 (X t )] < -Ch d /2)} < C exp(-Cnh d ) 
t=i 

Then by union bound, 



t=i t=i 

(A.24) 



P(U fteifn , i=1) ... )J v W {^4 J (X i )/n < Ch d /2}) < Ch m d n lognexp(-Cnh d m J 

(A.25) 

as n 4 oo. Summing the probabilities above over n, we conclude, by the Borel- 
Cantelli lemma, that the lower bound in assumption [T^iii) holds for large n almost 
surely. A similar argument gives the upper bound. □ 

A. 5 Anticoncentration Inequality for the Maximum of Gaus- 
sian Random Variables 

In this section, I derive an upper bound for the pdf of the maximum of correlated 
Gaussian random variables satisfying certain assumptions. Let {Z; : i = 1,...,S} 
be a set of standard Gaussian random variables. Assume that this set contains at 
least M independent random variables. Define W = maxj = i i ... i 5 Zi. Let m denote the 
median of W and fw{') denote its pdf. Then 

Lemma 5. snp w>m fw( w ) — C^\og(M + 1)S/M for some universal constant C. 

Proof. The case M — 1 is trivial. So, assume that M > 1. Let $(•) and <$(■) denote 
the cdf and the pdf of the standard Gaussian distribution. Since there is at least M 
independent standard Gaussian random variables, $ M (m) > 1/2 and m > 0. So, 
there exists some co nstant C > s uch that < 1 — <p(x)/(Cx) for any x > m (see 



proposition 2.2.1 in IDudleyl (jl999l ))and 



Mm) \ M 1 , . , 



39 



Let y denote the unique positive real number such that 

Note that y < m. In addition, y is increasing in M, so there exists some constant 
C\ > such that y > C\ for any M > 2. Taking logs of both sides of equation (1A.27I) 
and noting that log(l + x) < x for any x G R, we obtain <p(y) < ylogC/M for some 
constant C > 1. On the other hand, (j){y)/{Cy) < 1/2. So inequality log(l + x) > 2x 
for any x G (—1/2,0] gives 

— f^>log2 A.28 

Combining this inequality with y > C\ yields y < C v/log(M + 1) for any M if C is 
sufficiently large. Therefore, </>(?/) < CA/log(M + 1)/M and for any w > m, 

f w (w) < S(f)(w) < S(j)(m) < S<t>(y) < C^/\og(M + TjS/M (A.29) 

□ 

A. 6 Result on Gaussian Random Variables 

In this section, I state a result on Gaussian random variables which will be used in 
the derivation of the lower bound on the rate of uniform consistency. 

Lemma 6. Let £ n; n = l,...,oo, be a sequence of independent standard Gaussian 
random variables and Wi jU , i = 1, ...,n, n = 1, oo, be a triangular array of positive 



numbers. If Wi :1l < C\/\ogn with C G (0, 1) for all i = 1, n, n = 1, oo, then 

n 

lim EWn- 1 VexpK n & - w 2 in /2) - 1|] = (A.30) 

T). — ^fYl ' * 



i=l 



Proof . The proof is based on the generalization of lemma 6.2 in lDumbgen and Spokoiny 



J200l|). Denote Z i>n = exp(w i>n ^ - wf n /2) and t n = (E[J2ti Z iJn - I} 2 ) 1 ' 2 . Note 
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that EZ i>n = 1 and EZf n = exp(w 2 n ). Thus, 

n n 

t 2 n = C£( EZ ln - (EZ hn ) 2 ))/n 2 < exp«)/n 2 (A.31) 

i=l i=l 

if max i=lv . ^ n exp(w 2 n )/n — > 0. The last condition holds by assumption. So, 

n poo n 

Eln- 1 V expK„6 - ^ 2 n /2) - 1| = / P{\n' 1 V Z i<n - 1| > t)dt 
i=i 70 i=i 

/■oo 

< t„ + / t'A 2 rft 

= 2t n -»■ 

□ 

A. 7 Preliminary Technical Results 

In this section, I derive some necessary preliminary results that are used in the proofs 
of the theorems stated in the main part of the paper. It is assumed throughout that 
assumptions [U1H] hold. I will use the following additional notation. Let {V'nj^i be 
a sequence of positive real numbers such that ip n > C^p log n/n K ^ for some large 
constant C$ > and tp n — > as n — > oo. For any A G (0, 1), define cfi A '° G K and 
di-x' '■ IR — > [0, 1] by analogy with cfi A and gfj^ with Ej used instead of Sj for all 
z = 1, ...,n. Denote S% = {s G S n : / S /V s > -(cfif;° + /?„)}. For any A G (0, 1), 



define cf__ A G R and gf_ A : R ->■ [0, 1] by analogy with cf^ 5 and with used 

instead of S* MS . Let {e* : i = 1, ...,n} be an iid sequence of p-dimensional standard 
Gaussian random vectors that are independent of the data. Denote ej = S 1//2 e,- and 
ej = Y}l 2 ej. Note that (ij is equal in distribution to Yj. Finally, denote 

n 

£i,m,h — 2_^ w h{Xi, Xj)ej >m (A. 32) 

J=l 

n 

fi,m,h = Yl W ^ X » x i)f™(Xj) (A.33) 

3=1 



41 



3=1 
n 

ei,m,h = 22w h (Xi,Xj)ej 

3=1 



rpPIA _ 
rpPIAfl 



maxfes/Vs) 

= max(e 8 /V 8 ) 
seS n 



(A.34) 

(A.35) 

(A.36) 
(A.37) 



Note that T is equal in distribution to the simulated statistic. 



I start with a result on boun ds for weights and variances o 



The same result can be found in 



Horowitz and Spokoinyl (120011 ). 



the kernel estimator. 



Lemma 7. There exist constants C > and < C\ < C 2 < 00 such that, for any 
i,j = 1, ...,n ; m = 1, p, and h e H n , 



w h (X t ,X j )<C/(nh d ) 

and 

d/Vri? < V M < C 2 /Vn~¥ 
Proof. By assumptions Q] and |6j for any i = 1, ...,n and h G H n , 

n 

C x nh d < CM h/2 (Xi) < J2 K ( X * - X *) ^ M ^ X i) ^ C * nhd 



(A.38) 



(A.39) 



k=l 



and 



dnh d < R2 ( X i - X k) < C 2 nh 6 



(A.40) 



(A.41) 



k=l 



for some constants C > and < C\ < C 2 < 00. In addition, K{Xi — Xj) < 1 for 
any j = 1, ...,n. So, 



{X t - Xj) = K[X t - Xj)/ K ( X i - Xk) < C/(nh° 



(A.42) 



fc=i 
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By assumption [21 since Y^j=i w h{Xi,Xj) = 1, 



n 

4 1/2 

\i=i 



< 



i ii 

C max w fc (Xi,Xj 

7 = l,...,n 



and 



1/2 / n X 1/2 



o=i / \j=l 



(A.43) 
□ 



Lemma 8. E[m&x seSn \e s /V s \] < C(logra) 1/2 . 



Proof. For any s G S n , e s /V s is a standard Gaussian random variable. Denote ip = 
exp(x 2 ) — 1. Let || • ||^ denote ^-Orlicz norm. It is easy t o che ck that ||e s /V^||^ < 



C < oo. So, by lemma 2.2.2 in IVan der Vaart and Wellnerl (Il996f h 



E[m a x\e s /V s \] < C\\ max|e s /K||U < C(logn) 1 / 2 (A.44) 

since \S n \ < Crfi for some <p > 0. □ 
Lemma 9. max seSn \V S /V S - 1| = o p (n~ K ) and max seSn \V S /V S - 1| = Op(n _K ). 
Proof. By assumption [2], for any (i,m,h) G S n , 

n n 



In addition, 



\V'm,h ~ V Lh\ < J2 W ^ X ^rnm ~ ^ h mm\ (A.46) 

i=i 
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So, 



max \V^/Vg — 1| < C max max |E,- mm — E, mm | 

se5 n m=l,...,p j=l,...,n 

< C max 1 1 S,- — S,- 1 1 o 

3=l,...,n 

Assumption [5] gives max 3=li n ||Sj — S 3 -|| = o p {n~ K ). So, max sg5n (V^/K 2 — 1| = 
o p (n~ K ). Combining this result with inequality \x— 1| < \x 2 — 1|, which holds for any 
x > 0, yields the first result of the lemma. The second result follows from the first 
one and the inequality \l/x — 1| < 2\x — 1|, which holds for any \x — 1| < 1/2. □ 

Lemma 10. P{cff£°_*, > cfi£} = o(l) and P{cf_^% n < c™} = o(l) /or any 
sequences {^ n }^i awd {^n}S£Li of positive numbers satisfying v n + ip n < 1/2 and 
ipn > C^p\ogn/n K l 4 with large enough > 0. 



Proof. Denote 



and 



Pi = max 



max 



5-1 



P2 = max 



£J =1 ti; h (X, l X i )((E}'*-E / )e ]lv , 



1 l/2^ 



V; 



i,m,h 



Then 



IT 



< Pi + P2 



(A.47) 



(A.48) 



(A.49) 



Let A denote the event {max J=li ... in ||Sj — £j|| < n K }. By assumption [51 P(A) — > 1 
as n — > oo. Thus, it is enough to show that cf 7 ^'° < cf ^ and cf f^'°_ > cf ^ on 
A. 

As in the proof of lemma [9j max se s„ \V S /V S — 1| < Cn~ K on A. By lemma El 



_E[max seSn e s /Vg] < CyTogn. So, Markov inequality gives for any B > 0, on A, 



P(pi > Cx/Iognn-^SIF") < 1/B 



(A.50) 
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where Y™ is a shorthand for {li}" =1 . Consider p 2 . For any j = 1, n and m — 1, p, 

< E[nsf-sf n2iic,-irii?] 

< put 1 /* 

where the last line follows from lemmaEJ So, conditionally on Y™, on A, Y^j=i w h(Xi, Xj)((Jl l J 2 — 

1 /2 

)ej)m/^,m,h is mean-zero Gaussian random variable with variance bounded by 
p 2 n~ K for any (i,m, h) G S n . In addition, on A, max se s n V^/V^ < 2 for large n. Thus, 
Markov inequality and the argument like that used in lemma [8] yield 

P(p 2 > C^\ognpn- K/2 B\Y?) < 1/B (A.51) 

on A. Take B = ?W 4 / (plogn). Recall that ip n > C^p\ogn/n K ^. So, tp n > 
max(4/5, C\p 2 {\ogn) 2 n~ K l 2 B) for some large C\ > whenever C*i < C^. 

Note that T p/A '° is the maximum over \S n \ standard Gaussian random variables. 
In addition, for fixed m = l,...,p and h G H n , random variables {ei, m> h/Vi,m,h '■ 
(i,m,h) G S n } are mutually independent, \H n \ < Clogn. So, lemma |5] gives 

c i-tf-ip n /2 ~ c i-vn°-ip n — C^n/(p(l°g n ) 3//2 )- I wm assume that C in the last inequality 
is smaller than C\. 

Now the first part of the lemma follows from 

E[9Zt%S TPIA )\ Y ?] < E[9^t%ST PI ^- Pl -p 2 )\Yn 

< E[g P dt% n {T PIA » - Cy/^n"l*B)\Y?\ + 2/B 

< E[gZt% n/ 2iT PIAfi )\Yn+VB 
= l-v n -^ n /2 + 2/B 

< l-v n 

on A. The second part of the lemma follows from a similar argument. □ 

Lemma 11. E[g P ^f(max seSn (e s /V s ))] = l-u n +o(l) and E[g P ^f(- max seSn ( E s/V s ))] = 
1 — u n + o(l) /or any sequence {f n }^ =1 such that v n G (0, 1). 
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Proof. By lemma for any (i,m, h) £ S n and any j = 1, ...,n, 



^(Xi^OAW < ^/v 7 ^ < C/\frt4~ (A.52) 

Recall the definition of C(-) given before theorem [61 By assumption /3 = /3 n < C 
for some constant C > 0. So, C(g?I£°) < C//3 3 . In addition, || (^fjf«*°)'||oo < 
Given assumption^ the result follows by applying theorem [6] with g = gil a '°, Zj = Sj, 
Yj = Sj /2 e i5 a = C/y/nh^ and K < Cn^ for some > 0. □ 



Lemma 12. max seS „ \e s /V s \ = O p (y/\ogn) and max seSji \e s /V s \ = O p (y/fogn). 

Proof. Combining the definition of go, lemma HH and (3 n < C for some constant 
C > gives 

P{max(e s /V s ) > C^/h^} < 1 - E[g ((max(e s /V s ) + f3 n - C^n)/f3 n )} 

= 1 - E[g {{max{e s /V s ) + (3 n - Cy/to£ri)/p n )] + o(l) 

s£S„ 

< P{max(e s /K) > C^/fog~n~ - f3 n } + o(l) 



< P{max(e s /K) > (C/2) + o(l) 



By lemma EJ max se s n (e s /Vs) = O p {y/\ogn). So, by choosing n large enough and 
then C large enough, we can make P{max s& s , n (e a /'V^) > Ci/log n} arbitrarily small 
uniformly in n. The same reasoning gives the lower as well. We conclude that 



max sg 5 n |e s /K| = O p (y/\ogn). The second result follows from 

max |£ s /Vs| < max |e s /V^| max(V s /V s ) = O p (^/\og n) (A. 53) 

since max se s n (V s /V s ) = O p (l) by lemma [91 □ 
Lemma 13. P{max aeSn \ S g f s /V s > 0} < j n + o(l). 
Proof. By lemma [TTJ 



P{max( £s /K) < cZ^% n + fin} > E[g^% n {m^(e s /V s ))\ = l- ln -^ n + (l) 

(A.54) 
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Since for any s E S n \S°, f s /V s < -(cfi£^ + P n ), 

P{ max (f a /V a ) > 0} = P{ max D {f s /V s ) > 0} 

= P{ max {f s /V s + e s /V s ) > 0} 

s&S n \S 

n 

^ P { ^J-^\, + Pn) + Zs/Vs) > 0} 

8tz£>n\O n 

< P W(e./V.) > cfi£% n + Pn] 

< l-(l- Tn -Vv>+o(l) 

= 7n + V'n + 0(1) 

Noting that = o(l) yields the result. □ 
Lemma 14. P{Sf C S* MS } > 1 - 7„ + o(l). 

Proof. By lemma [TUl P{cf_f^°_^ n > cf_^ n } = o(l). In addition, for any x G (—1, 1), 
2/(1 + x) - 1 > 2(1 - x) - 1 > 1 - 2x > 1 - 2\x\ (A.55) 

So, 

P{SZcS™ s } = P{mm(fs/V s )>-2(cfli+P n )} 

> P{mm(f s /V s ) max(K/K) > -2(cf_ 7 ^ + /?„)} 

> P{min(-(cf^% n + /3 n ) + £ S /K) max(Vyv;) > -2(cf_^ + /?„)} 
= P{min( £s /K) > cZt% n +P n - 2(cfi4 + ft)/m^/F s )} 

> P{max(-e s /y s ) < -cf_ 7 i% n - /3„ + 2(cf^° + &)/ max(K/y s )} + o(l) 

> P{max(-e s /K) < + /?„)(1 - 2| max(K/K) - 1|} + o(l) 

Combining lemma [H] and Markov inequality yields 

7^ + Vn = 1 - E[g^f_^ n (max(e s /V s ))} 

< P{max(e s /K) > cf_^%J 

< C(logn)VV c f_^ 
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So, cZZ% n < C(logn) 1 /2/( 7n + ^). By lemma El \max seS n(V s /V s ) - 1| < Cn- 
wpal. So, wpal, 



( C f- J t%« + &X 1 - 2 I m ?x(K/V;) - 1|) > cZ*% n + Pn - C(\ogn)^n-«/( ln + ^) 

ses n 

(A.56) 

Take Xn = Cp(\ogn) 2 n~' i / ( 7n + ip n ). Then Xn = o(l) by the choice of By lemma 

El 

C f-£V + A. - C(logn) 1 /^- 7(7n + ^ > cZt\ n - Xn + fin (A.57) 
Therefore, 

P{S°CS* MS } > P{m % x(-e s /V s )<c^ n _ Xn + (3 n } + o(l) 

> 1 - In ~ ?Pn ~ Xn + o(l) 

= l- 7n + o(l) 

since ^ n + Xn = o(l) . □ 
Lemma 15. // / = P , tfiera P{S* MS = S n } > 1 - 7n + o(l). 

Proo/. By lemma QUI P{cfj^V > c f-i> = (!)- % lemma El max seSn (K/V;) < 
1 + n~ K wpal as n — > oo. If / = P , then for any s G S n , f s = e s . So, 



P{S* MS = S n } = P{mm(s s /V s )>-2(c^i+M} 

PIAfi 



> P{mm(e s /V s ) > -2(cf^% n + /?„)} + o(l) 

> P{min(e s /K) max(K/K) > -2(cf_^% n +/?„)} + o(l) 

> P{rnin( £s /K)(l + n"«) > -2(cf_^% n + &)} + o(l) 

P7A,0 



> PWe./y.) > -2(cf^% n + &)(1 - n-)} + o(l) 

> P{min( £s /K) > ~{cZt% n + AO} + 

PJAO 



= P{max(-e s /K) < + /?„)} + o(l) 

> ^^f-i'V(w(-^/^))] +°(i) 
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Combining these results with lemma [TT] yields 



P{S* MS = S n }>l- ln - i, n + o(l) (A.58) 

The result follows by noting that ip n = o(l). □ 
Lemma 16. cfif + (3 n < cfi£ + (3 n = P (V^)- 

Proof. Since S* MS C S n , cf*f + < cf™ + /3„. By lemmadOl P{cfi$ < cf™} = 
o(l). By assumption /3 n < C for some C > 0. Markov inequality and lemma Egive 
c i-a)°2 — Cy/logn for C large enough. Combining these results yields the statement 
of the lemma. □ 

Lemma 17. Let r > I, L > 0, x — (x 1 ,...,x d ) G M. d , h = (h^ h d ) G W 1 , and 
f G ^(t, L) for some q = 1, [r]. If q < [t], assume that for any x E M. d and all 
d-tuples of nonnegative integers a = (ai, a^) satisfying \a\ = q + 1, \D a f(x)\ < C 
for some constant C > 0. Then df(xi,...,Xd)/dx m > for all m = l,...,d implies 
that for any y = (yi, yd) G M d satisfying < y < h, 

/(x + y - / x > - ? ; ft < A.59 

(r - q + l)...(r - <r + g 

/or C = min(^ + 1, r). 

Proof. For any y = (yi,...,yd) G M d satisfying < y < h, choose a direction i = 



(/!,..., i d ) G M d by setting i m = y m j yEj=i»? for a11 m = Let 
denotes /c-th derivative of / in direction i evaluated at point x. Then f^-' l '{x) > 0. If 
+ ty) > for all t E (0, 1), then the result is obvious. If f^' l \x + t y) = for 
some to G (0, 1), then f^ k ' l \x + toy) = for all k — 1, <r. If ? = [r], then by Holder 
smoothness, f^ ,l \x + ty) > —(L(t — ^0) )^ Integrating it q times gives 

f{x + v) - Kx) >- — — ^ M' (A.») 

since £ = r in this case. If < [r], then f^' l \x + ty) > —C(t — t )\\y\\. Integrating 
it q times gives the inequality similar to flA.60j) with q + 1 instead of ( and C instead 
of L T_t . The result follows by noting that \\y\\ < \\h\\. □ 
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A. 8 Proofs of Theorems 



Proof of Theorem 1. Under the null hypothesis, for any s G S n , f s < since the 
kernel K is positive by assumption [6j By lemma [T0| P{cf_f a fL > cfi^} = o(l). By 



E[g^(f)} = E[g^t^Afs/Vs))} 

sG6„ 

> E[g^(m^(e s /V s ))] 

> ^k P ^(max(e s /V;))] + (l) 

> ^k P/ ^„(max(e s /K) ma^/V.))] + o(l) 

> E^t^MMejV^l + rT*))] + o(l) 

> E[g ((msx(e s /V s )(l + n~«) - cfi^J/^)] + o(l) 



Denote <5 n = (\ogn/n K ) 1 ^ 2 . Two different cases will be considered depending on 
whether (3 n > 5 n or (3 n < 5 n . Divide the sequence {n}^ =l into two subsequences, 
{nl}^ =1 and {n^}^ =1 , so that (3 n i > d~ n i and (3 n 2 < <5 n 2 for all fceN. First, consider the 
subsequence {nj,}^ =1 . For simplicity of notation, I will drop indices writing n instead 
of n\. By lemma [121 max seSn \e s /V s \ = O p (^f\ogn). So, max s6Sn \e s /V s \/ (n K f3 n ) < 
n~ K / 4 wpal as n — > oo. Since go has bounded first derivative, 



£M(max(e s /K)(l + rT") - C^)/)^)] 

= Eh((ma X ( £s /K) - cf_^J//3 n )] + o(l) 

StJn 



The last expression equals S^^l . (max se s n (e s /V^))] + o(l). Combining these re- 
sults and lemma [11] yields 



Next, consider the subsequence {nl}^L v Again, I will write n instead of n\. Take 
Xn = Cp(\ogn) 2 n~ K / 2 with large enough C. Note that Xn = o(l). As in lemma HU 



lemma [U max s€Sn {V s /V s ) < 1 + n K wpal as n — > oo. So, 



E[g[^(f)} >l- a -i, n + o(l) = 1 - a + o(l) 



(A.61) 




(1 - - f3 n > c 



PIA,0 

l—a—lf'n—X 



(A.62) 
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Continuing the chain of inequalities from above gives 



E[g P dt(T)] > P{max(5 s /K)(l + ^)<cf^lV} + o(l) 

> P{max( £s /y s ) < cf_ 7 i_V(l - n-)} + o(l) 

> P{max( £s /y s ) - /3 n < cf_ z ^ n _ x J + o(l) 



An application of lemma [TT] yields 

E[g^(f)] >l-a-ip n - X n + o(l) = l-a + o(l) (A.63) 

Now consider the RMS test function. By lemma HU -P{cf_ Q+27n > cf_^ 27n } < 
7 n + o(l). By lemma H31 P{max seSn \ S) D / S /V s > 0} < 7 n + o(l). So, 



> S[^ D _ a+27ii (max(/ s /y s ))] - 7n + (1) 

> ^[^- Q+27 „(max(/ s /K))] - 2 7n + o(l) 



Since S„ is nonstochastic, from this point, the argument similar to that used in the 
proof for the plug-in test function with instead of S n yields the result for the RMS 
critical values. 

Next assume that / = P . By lemma ITU] P{c^i a ^ n < cfi^} = o(l). By lemma 
H2 mm seSn (y s /V s ) > 1 — n~ K wpal as n — > oo. So, 

^[^(T)] = E[g[^(mM/V s ))\ 



< E[g^ n (m^(ejV s ))}+o(l] 

PIA,0 



< E[g^ n (m^(s s /V s ) wm(V./V.))] + o(l) 

< £[^J^(m«(e./Vy (1 _ „-«))] + (i) 

= £?h((max( £s /K)(l - n^) - c^J/Pn)] + o(l) 
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For the subsequence 0"4}i£U> writing n instead of n\ 



ki 



E[g ((max(e s /V s )(l - rT") - cf^J)//?, 



£[ 0o ((max( £s /K) _ cf^;V)//3n)] + o(l) 



So, the result that £?[^^f^(T)] < 1 — a + o(l) follows by applying [TTJ For the 
subsequence {nl}^, with the same choice of Xm 

(cf-^V + /?n)(l + 2n-) < cf_^V + xn ( A - 64 ) 
where I again write n instead of nj*. In addition, for any x e (0, 1/2), 

1/(1 - x) < 1 + 2x (A.65) 

So, 

P[<^CO] < P We./Vi)(l - n" K ) < cf_^;V + /?„} + o(l) 

< P{max( £s /K) < (cf_^;V + &X1 + 2rr«)} + o(l) 

< E[ 9 Zt^ n+x S^{e s /V s ))\ 

Again, the result that E[g[^(T)] < 1 — a + o(l) follows by applying lemma [HJ 

For the RMS test function, note that by lemma [T5l P{S* MS = S n } > l-7 n + o(l) 
whenever / = p . If 7 n = o(l), then 

E[g? M a S (T)\ = E[g(^ +2ln (f)\ + o(l) < 1 - a + o(l) (A.66) 



□ 

Proof of Corollary 1: 

Proof of Corollary 1. If (logn) 19 / (h^ in n) — > 0, then one can set g n so that o n (log n) 3 ^ 2 — >■ 
and (logn) 4 /(o^°/imin ri ) ~~ * 0- Then o n satisfies assumption [3 So, the result of the- 
orem [T] holds for o n instead of f3 n . Let cf /j4,0 ' e denote the value of c^ IA '° evaluated 
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with g n instead of n for all x G (0, 1). By lemma HOI ^{cfi^ n > cfi^} = o(l). By 
lemma [5l cf /A ' 0,e „ + p n < cf J f'°„ ; , for C large enough. So, 

P{f<c^} > E[g ((f + Qn -c^)/ Qn )] 

> E[g ((f + Q n -cZi^ n )/Qn)) + o(l) 

> E[g ((f - cf_^_^ (logn)3/2 )/^)] + o(l) 

From this point, the argument like that used in the proof of theorem [1] with g n instead 
of j3 n leads to P{T < cf 7 ^} > 1 — a + o(l). All other statements of theorem [1] follow 
from similar arguments. □ 

Proof of Theorem 2: 

Proof of Theorem 2. For any w G Q p , there exist i(w) G N and m(w) = 1, ...,p such 
that fm( w )(Xi(w)) > P- For simplicity of notation, I will drop index w. By assumption 
[31 there exists a ball Bs(Xi) with center at Xj and radius 5 such that f m (Xj) > p/2 for 
all Xj G -B^Xj). Note that 5 can be chosen independently of w. So, for some N G N 
and any n > N, there exists a triple s n = (i n ,m,h n ) G 5* n with h n bounded away 
from zero such that f m (Xj) > p/2 for all Xj G Bh„(X in ). Hence, f Sn > p/2. Lemma[7] 
gives V Sn < n~ 4) for some > 0, so f Sn /V Sn > Cn^. By lemma[9j \V Sn /V Sn -l\ = o p (l). 
So, for any C < C, P{f s jV Sn > Cn^} -> 1. Thus, 

^f-jT)] < P{f < cf_ Q + /3J 

< P{fsjV Sn < cf_ a + A, + max \s s /V s \} 

< P{ci- a + Pn + max |e s /y s | > Cn+} + o(l) 

sGS„ 

The result follows by noting that from lemmas IT2l and [T6l cf__ a +/3 n +max se s n |e s /V^| = 



O p (Vhin). □ 
Proof of Theorem 3: 

Proof of Theorem 3. As in the proof of theorem 2, since p(w, H ) > 0, there exists 
i G N such that /™(Xj) > p for some m = l,...,p and p > 0. In addition, by 
assumption [3l there exists a ball Bg(Xi) such that f™(Xj) > p/2 for all Xj G Bg(Xi). 
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So, for some iV G N and any n > N, there exists a triple s n = (i n , m, h) G S n such 
that fZ(Xj) > p/2 for all Xj G B h (X in ). Hence, / s n n > a n p/2. Note that in contrast 
with theorem 2, now we choose fixed bandwidth value h. By lemma [7J K n < C j \fn. 
Then lemma [9] gives P{fg/V Sn > Ca n jyfn) — > 1 for some C > 0. The same argument 
as in the proof of theorem 2 yields 

E[g^ a {f)} < P{cf_ Q + p n + max |e 8 /V,| > Ca nV ^} + o(l) (A.67) 

Combining cf_ a + /3 n + max se5n \e s /V s \ = O p (y/\ogn) and a n ^/n/ hgn oo gives 
the result. □ 

Proof of Theorem 4: 

Proof of Theorem 4- First, consider r < 1 case. In this case, ( = r. Since d > 1, we 
are in the situation ( < d. For any u> G C7$, there exist G and m(w) = 1, ...,p 
such that > (C/2)^Lin- By assumptions [1] and [HI there exists = 

l,...,n such that \\X i{w) - X jiw) \\ < 3h min and s n (w) = (j(w),rn(w),h min ) G 5* n . By 
assumption El fZ( w )i x i) > C^Lin for a11 1 = l,-,n such that X t G ^^(X,-^)) for 
some constant C. So, f™t w \ > Ch !^ - m . By assumption [7J nh^J logn — >■ oo as n — )■ oo. 
By lemmaEl K„H < C/^K~. So, 

/: W /(^H\^) > {C/C)^nh 2 ^ d /\ogn > {C/C)^Jnh%J\ogn -+ oo 

(A.68) 

uniformly in w £ Q#. The result follows from the same argument as in the proof of 
theorem 2. 

Consider r > 1 case. Suppose ( < d. For any w G there exist i(w) G 
and m(w) = l,...,p such that fZ( w )( X i(w)) > (C/2)h c min . For m = l,...,d, set 
if df™( w j(Xi( w ^)/dx m > and — 4/i min otherwise. Consider the cube 
C whose edges are parallel to axes and that contains vertices (Xj^i, Xj^d) and 
(X i ( t0 ) ) i + 2e 1 , ...,X i ( t0 ) )d + 2e d ). By lemmaHZl for all x G C, fZ( w )( x ) > C^Lin for some 
constant C By the definition of N$ and assumption [IJ there exists l(w) = l,...,n 
such that Xi(u) £ -£>Vm(^j(w),i + ei, Xj^)^ + e^). By assumption El there exists 
j(w) = l,...,n such that X j(w) G SaiwG^iM.i + e 1 , X i{w)4 + e d ) and s n (iu) = 
(j(w),m(w) ) G oV So, /™ fal (X) > Ch^ for all Z = l,...,n such that X G 
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Bf lmin (Xjr w \). The rest of the proof follows from the same argument as in the case 
r < 1. 

Suppose ( > d. The only difference between this case and the previous one 
is that now optimal testing bandwidth value is greater than h m \ n . Let h Q be the 
largest bandwidth value in the set S n which is smaller than {\ogn/n) 1 ^ 2C - +d \ For any 
w G Gd, the same construction as above gives s n (w) = (j (w) , m(w) , h Q ) G S n such 
that fZ{ w )( X i) - P${w,H ) -Ch^ for all I = 1, ...,n such that X x G B ho {X^ w) ). Since 
p$(w,H ) > b n (logn/nY^ 2< ^ +d ^ for some sequence of real numbers {& n }£Li such that 
b n -> oo as n oo, > (6 n -C7)(logn/n)^( 2 ^). By lemmad V, n(lfl) < C/vW 

Then 

fZ( w) /(V Sn(w) Vh^) > {b n - C)/(2C) oo (A.69) 
The result follows as above. □ 
Proof of Theorem 5: 

Proof of Theorem 5. Define v : R x R + — > R + as follows. Set v(x, h) = if x < or 
a; > 2 for all ft G R+. 

First, define functions b\, ...,bK on (0, 1] for some K to be chosen below by the 
following induction. Set &i(x) = +1 for x G (0, 1/2] and —1 for x G (1/2, 1]. Given 
b u ...,b k _ u for i = l,3,...,2 fc -l and a; G ((i- l)2- fe , i2~% set 6 fc (x) = +1 if b k ^(y) = 
+1 for y G ((« — l)2~ fc , (i + l)2~ fe ] and —1 otherwise. For i = 2,4, ...,2 fc and x G 
((i - l)2- k ,i2- k ], set 6 fe (x) = -1 if b k ^(y) = +1 for ye((i - 2)2~ fc , z2- fc ] and 
+1 otherwise. By induction, define bi,...,bx where is the largest integer strictly 
smaller than r, i.e. K — [r]. 

Now let us define f : R x R + -» R + . Set v(x, h) = if x < or x > 2 for all 
/i G R + . For x G [0, 2], f will be defined through its derivatives. Set d k v(0, h)/dx k = 
for all k = 0,...,K. For i = 1,...,2 , once function d K v(x,h)/dx K is defined for 
x G [0, 0' - 1)2-*], set 

<9^(x, TO/ftp* = 0*"v((z - 1)2-*, 70/&c* + b K {x)h K L{x - (i - 1)2- K ) T ^ (A.70) 

for x G ((i — 1)2 _A ", V2~ K \. These conditions define function v(x, h) for x G [0, 1] and 
h G R + . For x G (1, 2] and ft, G R + , set v(x, h) = v(2 — x, h) so that v is symmetric 
in x around x = 1. It is easy to see that for fixed h G R + , v(-/h, h) G F^t, L) and 
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swp xem v(x/h,h) G (Cih T ,C 2 h T ) for some positive constants C\ and C 2 independent 
of h. 

Let q : R d x R + ->■ R+ be given by g(x, h) = v(\\x\\/h + l,h) for all (x,h) G 
R d x R + . Note that for fixed h G R + , <?(•, h) G ^(r, L), h) = if ||x|| > /i, and 
g(0 d , /i) = sup^d ft,) G {Cih T , C 2 h T ). 

Since r n (n/ logn) r ^ 2r+ ^ — >■ 0, there exists a sequence of positive numbers {ipn}^! 
such that r n = ^(logn/n) r ^ 2r+ ^ and ^ n — )■ 0. Set h n = ip n (Jog n/n) 1 ^ 2T+d \ By the 
assumption on packing numbers N(h, S#), there exists a set G N# : Z = 1, N n } 
such that HXjvy — X/(7 2 )|| > 2h n for / 1; / 2 = N n if / x 7^ l 2 and iV n > Cft~ d 
for some constant C. For / = l,...,N n , define function f l : R d — > MP given by 
f[{x) = q{x — Xj(i),h n ) and f m {x) = for all m = 2,...,p for all x G R d . Note 
that functions {/'}^\ have disjoint supports. Moreover, for every I = l,...,N n and 
m = l,...,p, f l m G J-[ t ](t,L). Let {£i}™ =1 be a sequence of independent standard 
Gaussian random vectors N(0,I p ). For I = l,...,N n , define an alternative, wi, with 
the regression function / and disturbances {&i}™ =1 . Note that p#(wi,H ) > Cr n for 
all I = 1, N n for some constant C. In addition, let wo denote the alternative with 
zero regression function and distur bances i. 



As in the proof of lemma 6.2 in lDumbgen and Spokoinyl (1200 for any sequence 



t (Yi, ...,Y n ) of tests with sup lue g E w [(p n ] < a, 



inf E w [(f) n }-a < min E w [<j) n ] - E Wo [<j> n ] 

w£g,p#(w,H )>Cr n l=l,...,N n 

< ^E Wl [(j) n ]/N n -E WQ [(l) n ] 
i=i 

< E W0 [(J2(dP Wl /dP WQ )/N n - 1)0, 

i=i 

N n 



< E W0 [\^dP w JdP W0 /N n -l\\ 



i=i 



where dP w JdP WQ denotes a Radon- Nykodim derivative. For I = l,...,N n , denote 
"i = (E?=i(/l(^)) 2 ) 1/2 and ft = Er=i /{ WniM- Then 

dP Wl /dP W0 = exp(^6 - ^ 2 /2) (A.71) 
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Note that loi < Cn 1 / 2 h T n ^ rd ^ 2 . In addition, under the model wq, ^ are independent 
standard Gaussian random variables. So, an application of lemma [6] gives 



N„ 



E W0 [\ dP w JdP W0 /N n — 1|] — ► (A.72) 



i=l 



if Cn^hl^ 12 < C'(logiV„,) 1 /2 for 

some constant C G (0, 1) for all large enough n. 
The result follows by noting that r^^hn = o(y/\og n) and \ogN n > Clogn for 
some constant C . □ 

Proof of Corollary 2: 

Proof of Corollary 2. Replace p by K n both in ip n and \ n in all preliminary results 
and theorem [TJ Then all preliminary results except lemma [TT] hold for the test with 
K n — > oo. Lemma [TT1 holds with conditions (hi) and (iv) in the corollary replacing 
assumption [71 So, the first result follows from the same argument as in theorem [TJ 
For any w G Q p , there exists some m(w) G N such that sup iGn [f^ w - ) (X i )] + > 0. Once 
m(w) is included in the test statistic, the second result follows as in the proof of 
theorem [2j □ 
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